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Foreword 


This  volume  contains  the  papers  accepted  for  presentation  at  the  Second  International  Workshop  on 
Multistrategy  Learning  (briefly,  MSL-93),  held  in  Harpers  Ferry,  WV,  May  26-29,  1993.  The 
workshop  was  sponsor^  by  the  Office  of  Naval  Research  and  organized  by  the  Center  for  Artificial 
Intelligence  at  GeOTge  Mason  University. 

The  papers  represent  research  on  multistrategy  learning  conducted  at  leading  research  laboratories  in 
the  U.S.  and  other  countries,  such  as  Austria,  Australia,  Belgium,  Germany,  Italy,  Japan, 
Romama,  Slovenia,  Spain,  and  United  Kingdom.  The  presence  of  participants  from  so  many 
countries  is  an  indication  of  a  truly  international  significance  of  the  research  in  this  area.  To  help  the 
reader  capmre  the  variety  of  research  directions,  papers  have  been  grouped  into  five  categories, 
according  to  their  primary  theme:  general  issues,  knowledge  base  refinement,  cooperative 
integration,  multiple  computational  strategies,  special  topics  and  applications. 

Since  multistrategy  learning  is  one  of  the  newest  directions  in  the  study  and  development  of  systems 
with  learning  capabilities,  a  brief  explanation  of  its  aims  may  be  useful  here.  Research  in  this  area 
concerns  the  development  of  learning  systems  that  employ  two  or  more  inferential  and/or 
computational  strate^es  in  a  learning  process.  Though  initi^  research  had  been  primarily  oriented 
toward  integrating  different  inferential  strategies  (i.e.,  different  types  of  inference),  more  recent 
research  shows  a  trend  to  integrate  also  different  computational  strategies  (i.e.,  different  knowledge 
representations  and  associated  processing  methods).  These  Proceedings  reflect  this  trend  by 
induing  a  section  on  methods  for  integrating  such  multiple  computational  strate^es.  Multistrategy 
learning  systems  are  of  increasing  research  interest  due  to  their  potentially  significant  advantages 
over  monostrategy  systems.  Such  systems  can  learn  from  a  greater  variety  of  inputs,  with  different 
amounts  of  prior  knowledge,  and  generate  different  kinds  of  knowledge.  Consequently,  they  could 
be  useful  for  a  wide  range  of  practical  problems.  Since  human  learning  is  inherently  multistrategy, 
the  research  in  this  area  is  also  of  significant  importance  to  the  study  of  human  learning,  and  h^ 
opened  new  opponunities  for  a  cross-fertilization  of  the  two  fields.  Multistrategy  learning 
woikshops  serve  as  a  forum  for  researchers  to  present  and  discuss  their  recent  research  in  this  new, 
rapidly  evolving  and  very  challenging  area. 

We  gratefully  acknowledge  the  support  from  the  Office  of  Naval  Research,  and  express  our  special 
thanks  to  Lt.  Comm.  Robert  Powell,  without  whose  interest  and  encouragement  this  Workshop 
would  not  have  happened. 

We  thank  Dr.  Su-Shing  Chen  and  Dr.  Andrew  Sage  who  have  honored  the  workshop  with  invited 
presentations. 

We  also  thank  the  many  individuals  who  helped  in  the  organization  and  conduct  of  the  workshop. 

The  Program  Committee  members  and  the  auxiliary  reviewers  provided  careful  and  timely 
reviews  of  the  submitted  papers.  Their  assistance  was  indispensable  for  insuring  the  high 
quality  of  the  contributions. 

Michael  Hieb,  Nina  Kaull,  and  Janusz  Wnek  were  in  charge  of  the  local  organization.  They 
diligently  directed  and  executed  many  organizational  aspects  of  the  workshop. 

The  research  assistants  of  the  GMU  Center  for  Artificial  Intelligence,  in  particular,  Jerzy  Bala, 
Eric  Bloed(^,  Tomasz  Dybala,  Ibrahim  Imam,  Ken  Kaufman,  Mark  Maloof,  Alan  Schultz,  and 
Haleh  Vafaie,  provided  invaluable  help  in  handling  many  technical  and  managerial  details.  They 
are  a  great  and  reliable  team,  whose  help  cannot  be  overstated. 

The  Second  International  Workshop  on  Multistrategy  Learning  was  the  result  of  the  enthusiastic 
work  and  the  contribution  of  all  the  people  mentioned  above.  We  sincerely  thank  everyone  for  their 
help. 


Ryszard  S.  Michalski  and  Gheorghe  Tecuci 
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Abstract 

Research  on  multistrategy  task-adaptive 
learning  aims  at  integrating  all  basic 
inferential  learning  strategies — learning  by 
deduction,  induction  and  analogy.  The 
implementation  of  such  a  learning  system 
requires  a  knowledge  representation  that 
facilitates  performing  a  multitype  inference  in 
a  seamlessly  integrated  fashion.  This  paper 
presents  an  approach  to  implementing  such 
multitype  inference  based  on  a  novel 
knowledge  representation,  called  Dynamic 
Interlaced  Hierarchies  (D^.  DIH  integrates 
ideas  from  our  research  on  cognitive  modeling 
of  human  plausible  reasoning,  the  Inferentitd 
Theory  of  Learning,  and  knowledge 
visualization.  In  DIH,  knowledge  is 
partitioned  into  a  "static"  part  that  represents 
relatively  stable  knowledge,  and  a  "dynamic" 
part  that  represents  knowledge  that  changes 
relatively  frequently.  The  static  part  is 
organized  into  type,  part,  or  precedence 
hierarchies,  while  the  dynamic  part  consists  of 
traces  that  link  nodes  of  different  hierarchies. 
By  modifying  traces  in  different  ways,  the 
system  can  perform  different  knowledge 
transmutations  (patterns  of  inference),  such  as 
generalization,  abstraction,  similization,  and 
their  opposites,  specialization,  concretion  and 
dissimiUzation,  respectively. 

Key  words:  multistrategy  learning, 
inferential  theory  of  learning,  knowledge 
transmutation,  generalization,  abstraction, 
similization. 


1.  Introduction 

The  development  of  multistrategy  learning 
systems  requires  a  powerful  and  easily 
modifiable  knowledge  representation  that 
facilitates  multitype  inference.  This  is 
particularly  true  in  the  case  of  multistrategy 
task-adaptive  learning  (MTL)  systems  that 
integrate  a  whole  range  of  inferential  strate¬ 
gies,  such  as  empirical  induction,  abduction, 
deduction,  plausible  deduction,  abstraction, 
and  analogy  (Michalski,  1990,  1991;  Tecuci 
and  Michalski,  1991;  Tecuci,  1993).  A  MTL 
system  adapts  a  strategy  or  a  combination  of 
strategies  to  the  learning  task,  defrned  by  the 
available  input  knowledge,  the  learner’s 
background  knowledge  and  the  learning  goal. 
A  theoretical  framework  for  the  development 
of  MTL  systems  has  been  presented  in 
(Michalski,  1993). 

This  paper  presents  basic  ideas  underlying  a 
knowledge  representation  proposed  for  the 
implementation  of  a  MTL  system  and  its  use 
for  implementing  multitype  inference.  This 
representation,  called  Dynamic  Interlaced 
Hierarchies  (DIH),  integrates  ideas  from  our 
research  on  modeling  human  plausible 
inference,  the  Inferential  Theory  of  Learning 
and  the  visualization  of  knowledge.  DIH 


encompasses  many  different  forms  of 
knowledge  -  facts,  rules,  dependencies,  etc., 
and  facilitates  knowledge  transmutations, 
described  in  the  Inferential  Theory  of 
Learning  (TTL)  (Michalski,  1993).  This  paper 
shows  how  DIH  supports  several  basic 
patterns  of  knowledge  change  (transmu¬ 
tations),  such  as  generalization,  abstraction, 
similization,  and  their  opposites,  special¬ 
ization,  concretion  and  dissimilization, 
respectively.  These  operations  are  performed 
on  DIH  traces,  which  correspond  to  well- 
formed  predicate  logic  expressions  associated 
with  a  degree  of  belief. 

While  our  previous  work  has  focused  on  the 
visualization  of  attribute-based  representations 
for  empirical  induction  (Wnek  &  Michalski, 
1991),  DIH  allows  the  visualization  of 
structural  (attributional  and  relational) 
representations.  The  underlying  assumption  is 
that  the  syntactic  structure  for  representing 
any  knowledge  should  reflect  as  closely  as 
possible  the  semantic  relationships  among  the 
knowledge  components,  and  facilitate  knowl¬ 
edge  modifications  that  correspond  to  the  most 
frequently  performed  inferences.  An  early 
implementation  of  this  idea  was  in  the 
ADVISE  system,  which  used  three  forms  of 
knowledge  representation:  relational  tables, 
networks  and  rules  (Michalski  et  al.,  1986). 

The  DIH  approach  assumes  that  a  large  part  of 
human  conceptual  knowledge  is  organized 
into  various  hierarchies,  primarily  type,  part 
and  precedence  hierarchies  (see  Section  3  for 
an  explanation).  Such  hierarchies  reflect 
frequently  occurring  relationships  among 
knowledge  components,  and  make  it  easy  to 
perform  basic  forms  of  inference. 

The  initial  idea  for  DIH  stems  from  the  core 
theory  of  human  plausible  reasoning  (Collins 
&  Michalski,  1989;  Boehm-Davis,  Dontas  & 


Michalski,  1990).  The  theory  presents  a 
formal  representation  of  various  plausible 
inference  patterns  observed  in  human 
reasoning. 

DIH  is  more  fully  described  in  (Hieb  & 
Michalski,  1993). 

2.  Relevant  Research 

at* 

The  core  theory  of  Plausible  Reasoning 
presents  a  system  that  formalizes  various 
plausible  inference  patterns  and  “merit 
parameters^  that  affect  the  certainty  of  these 
inferences.  This  system  combines  structural 
aspects  of  reasoning  (determined  by 
knowledge  structures)  with  parametric  aspects 
that  represent  quantitative -belief  and  other 
measures  affecting  the  reasoning  process. 

Various  components  of  the  "Logic  of  Plausible 
Reasoning"  have  been  implemented  in  several 
systems  (Baker,  Burstein  &  Collins,  1987; 
Dontas  &  Zemakova,  1988;  Kelly,  1988). 
These  implementations  used  various  subsets 
of  the  inferences  (“statement  transforms**) 
described  in  the  core  theory  to  investigate  the 
parametric  aspects  of  the  theory.  The  imple¬ 
mentations  demonstrated  how  the  core  theory 
of  plausible  reasoning  can  be  applied  to 
various  domains.  DIH  specifies  a  broader  set 
of  knowledge  transmutations  in  a  general  and 
well-defined  knowledge  representation.  These 
transmutations  are  part  of  a  framework  for 
both  reasoning  and  learning. 

The  organization  of  concepts  into  various 
hierarchies  has  been  proposed  as  a  plausible 
structure  for  human  semantic  memory  quite 
early  (Collins  &  Quillian,  1972).  The 
WordNet  project  at  Princeton  University, 
directed  by  George  Miller,  concerns  the 
implementation  of  an  electronic  thesaurus 
using  such  a  memory  structure  (Beckwith  et 


al.,  1991).  WordNet  is  a  very  large  lexical 
database  with  approximately  50,000  different 
word  forms.  WordNet  divides  the  lexicon  into 
various  categories  including  nouns,  verbs,  and 
modifiers  (adjectives  and  adverbs). 
Significantly,  the  nouns  are  stored  in  topical 
hierarchies  (both  type  and  part  hierarchies), 
lending  support  to  the  DIH  representation. 
However,  while  WordNet  can  be  used  as  a 
source  of  DIH  hierarchies,  it  does  not  provide 
any  inferential  facilities. 

Other  relevant  research  includes  the 
development  of  the  Common  Knowledge 
Representation  Language  (CKRL),  done  as 
part  of  an  ESPRIT  project  (Morik,  Causse  & 
Boswell,  1991).  CKRL  offers  a  language  in 
which  knowledge  can  be  exchanged  between 
machine  learning  tools  and  it  uses  the  set  of 
most  common  representation  structures  and 
operators.  While  CKRL’s  representation  for 
multistrategy  learning  seeks  to  integrate  the 
various  representations  employed  by  several 
different  learning  programs  for  commu¬ 
nication  of  knowledge  between  the  machine 
learning  tools,  our  aim  is  to  develop  a 
representation  that  facilitates  an  integration  of 
learning  and  inference  processes. 

Semantic  network  knowledge  representation 
systems,  such  as  the  KL-ONE  family 
(Brachman  et  al.,  1991),  utilize  a  large  net¬ 
work  of  relationships  between  concepts, 
intermixing  different  relationships.  The 
hierarchies  they  use  are  tangled,  in  which  a 
concept  can  have  more  than  one  parent.  As  a 
consequence,  implementing  knowledge  trans¬ 
mutations,  e.g.,  generalization,  is  not  as  easy 
as  in  DIH.  DIH  facilitates  such  transmutations 
because  it  uses  only  single-parent  hierarchies, 
representing  a  structuring  of  a  set  of  entities 
from  a  certain  viewpoint.  In  DIH,  a  concept 
can  belong  to  different  hierarchies,  reflecting 
the  fact  that  a  given  concept  (or  object)  can 


usually  be  classified  from  several  different 
viewpoints. 

The  design  of  semantic  networks  is  primarily 
oriented  toward  facilitating  deductive 
inference,  and  is  not  usually  concerned  with 
knowledge  visualization.  The  design  of  DIH  is 
oriented  toward  facilitating  multitype 
inference  and  providing  a  basis  for  the  visual 
presentation  of  knowledge.  DIH  also  utilizes  a 
hierarchy  of  merit  parameters  to  represent 
probabilistic  factors  associated  with  plausible 
reasoning. 

3.  Basic  Components  of  DIH 

The  theory  of  plausible  reasoning  postulates 
that  there  are  recurring  patterns  of  human 
plausible  inference.  To  adequately  represent 
these  patterns,  one  needs  a  proper  knowledge 
representation.  The  DIH  approach  partitions 
knowledge  into  a  "static"  part  and  “dynamic” 
part.  The  static  part  represents  knowledge  that 
is  relatively  stable  (such  as  established 
hierarchies  of  concepts),  and  a  "dynamic"  part 
that  represents  knowledge  that  changes 
relatively  frequently  (such  as  statements 
representing  new  observations  or  results  of 
reasoning).  The  static  part  is  organized  into 
type  hierarchies  (TH),  part  hierarchies  (PH) 
and  precedence  hierarchies.  Precedence 
hierarchies  include  several  subclasses,  specif¬ 
ically,  measure  hierarchies  (MH),  quantifi¬ 
cation  hierarchies  (QH)  and  schema  hierar¬ 
chies  (SH).  The  dynamic  part  consists  of 
traces  that  represent  knowledge  involving 
concepts  from  different  hierarchies.  Each  trace 
links  nodes  of  two  or  more  hierarchies  and  is 
assigned  a  degree  of  belief. 

These  hierarchies  are  composed  of  nodes 
representing  abstract  or  physical  entities,  and 
links  representing  certain  basic  relationships 
among  the  entities,  such  as  “type-of “part- 
of’  or  “precedes”.  In  the  “pure”  form,  these 


hierarchies  are  single  parent,  that  is,  no  node 
can  have  more  than  one  parent  The  root  node 
is  assigned  the  name  of  the  class  of  entities 
that  are  organized  into  the  hierarchy  from  a 
given  viewpoint 

A  type  (or  generalization)  hierarchy  organizes 
concepts  in  a  given  class  according  to  the 
“type-of’  relation  (also  called  a  “general¬ 
ization”  or  “kind-of  ’  relation).  For  example, 
different  types  of  “animals”  can  be  organized 
into  a  “type”  hierarchy. 

A  part  hierarchy  organizes  entities  according 
to  a  “part-of  ’  relationship.  For  example,  the 
world,  viewed  as  a  system  of  continents, 
geographical  regions,  countries,  etc.,  can  be 
organized  into  a  part  hierarchy.  While  proper¬ 
ties  of  a  parent  node  in  the  type  hierarchy  are 
inherited  by  children  nodes,  this  does  not 
necessarily  hold  for  a  part  hierarchy.  There  are 
several  different  part  relationships,  which 
include  part-component,  part-member,  part- 
location  and  part-substance  (Winston,  Chaffin 
and  Herrraaim,  1987). 

To  represent  relationships  among  elements  of 
ordered  or  partially  ordered  sets,  a  class  of 
precedence  hierarchies  is  introduced.  Hier¬ 
archies  in  this  class  represent  hierarchical 
structures  of  concepts  ordered  according  to 
some  precedence  relation,  such  as  “A  precedes 
B”,  “A  is  greater  than  B”,  “A  has  higher  rank 
than  B”,  etc. 

There  are  several  subclasses  of  precedence 
hierarchies.  One  subclass  is  a  measure 


hierarchy,  in  which  leafs  stand  for  values  of 
some  physical  measurement,  for  example, 
weight,  length,  width,  etc.,  and  the  parent 
nodes  are  symbolic  labels  characterizing 
ranges  of  these  values,  such  as  “low”, 
“medium”,  “high”,  etc.  Figure  1  shows  a 
measure  hierarchy  of  possible  values  of 
people’s  height.  Dotted  lines  indicate  a 
continuity  of  values  between  nodes.  Arrows 
indicate  the  precedence  order  of  the  nodes. 
Another  subclass  hierarchy  is  a  belief 
hierarchy,  in  which  nodes  represent  degrees  of 
an  agent’s  beliefs  in  some  knowledge 
represented  by  a  trace. 

Other  subclasses  of  precedence  hierarchies 
include  a  rank  hierarchy  and  a  quantification 
hierarchy.  A  rank  hierarchy  consists  of  values 
representing  the  “rank”  of  an  entity  in  some 
structure,  e.g.,  an  administrative  hierarchy  or 
military  hierarchy.  A  quantification  hierarchy 
consists  of  nodes  that  represent  different 
quantifiers  for  a  set  (An  example  is  shown  in 
Figure  2).  A  quantification  hierarchy  that  is 
frequently  used  in  commonsense  reasoning 
includes  such  nodes  as  “one”,  “some” 
(corresponding  to  the  existential  quantifier), 
“most”,  and  “all”  (corresponding  to  the 
universal  quantifier). 

Each  hierarchy  has  a  heading  that  specifies  its 
kind  (TH,  PH,  MH,  QH  or  SH)  and  the 
underlying  concept  (or  viewpoint)  used  for  the 
creation  of  the  hierarchy.  In  addition,  the  type 
and  part  hierarchies  also  have  a  top  node  that 
in  the  type  hierarchies  stands  for  the  class  of 


all  entities  in  the  hierarchical  structure,  and  in 
the  part  hierarchies  for  the  complete  object 

Schema  hierarchies  (or  schema)  are  structures 
that  indicate  which  hierarchies  are  connected 
in  order  to  express  multi-argument  concepts  or 
relationships.  For  example,  the  schema 
hierarchy  for  the  concept  of  ’’physical-object” 
can  be  <shape,  size>.  This  states  that  an 
attribute  ‘  shape”  applies  to  any  object  that  is  a 
“physical-object”  (a  node  in  the  “physical- 
object”  hierarchy),  and  produces  a  shape 
value,  which  is  a  node  in  the  “shape” 
hierarchy.  The  schema  hierarchy  for  the 
concept  of  “giving”  may  be  <giver,  receiver, 
object,  time>  that  states  that  this  concept 
involves  an  agent  that  gives,  an  agent  that 
receives,  an  object  that  is  being  given,  and  the 
time  when  the  “giving”  occurs.  The  agents, 
object  and  time  are  elements  of  their 
respective  hierarchies. 

DIH  also  makes  a  distinction  between 
structural  and  parametric  knowledge.  The 
structural  knowledge  is  represented  by 
hierarchies  and  traces  that  link  nodes  of 
different  hierarchies.  Parametric  knowledge 
consists  of  numeric  quantities  characterizing 
structural  elements  of  knowledge.  In  DIH,  this 
knowledge  is  represented  via  precedence 
hierarchies  of  merit  parameters.  The  basic 
merit  parameter  is  a  belief  measure  that 
characterizes  the  “truth”  relationship  of  a 
given  component  of  knowledge  representation 
(a  trace),  as  estimated  by  the  reasoning  agent. 
Other  merit  parameters  include  the  forward 
and  backward  strength  of  a  dependency, 
frequency,  dominance,  etc.  (Collins  and 
Michalski,  1989;  Michalski,  1993).  In  this 
paper,  we  will  consider  only  one  merit 
parameter,  namely,  the  belief  measure. 

The  theory  of  human  plausible  reasoning 
(Collins  and  Michalski,  1989)  postulates  that 


people  rely  primarily  on  the  stiuctural 
knowledge,  and  resort  to  parametric 
knowledge  when  the  “structural”  reasoning 
does  not  produce  a  unique  result.  They  resist 
performing  uncertain  inferences  based  on  only 
parametric  knowledge,  and  they  are  not  good 
at  assigning  a  degree  of  certainty  to  a 
statement  based  only  on  the  combination  of 
the  certainties  of  its  constituents,  without 
taking  into  consideration  the  meaning  of  the 
whole  sentence.  A  reason  for  this  may  be  that 
there  does  not  exist  a  normative  model  for 
reasoning  under  uncertainty  that  is 
independent  of  the  structural  aspects  of 
knowledge,  i.e.,  its  meaning.  Plausible 
reasoning  about  a  problem  or  question 
typically  involves  both  structural  and 
parametric  knowledge  components. 

Nodes  of  a  hierarchy  are  elementary  units  of 
the  DIH  representation.  Each  node  represents 
some  real  or  abstract  entity — a  concept,  an 
object,  a  process,  etc.  A  given  entity  can  be  a 
node  in  multiple  hierarchies,  where  each 
hierarchy  structures  a  set  of  entities  from  a 
different  viewpoint.  The  relevant  viewpoint  is 
determined  by  the  context  of  the  discourse. 

As  mentioned  earlier,  the  basic  structures  in 
the  DIH  representation  are  hierarchies,  nodes, 
traces  and  schema.  Our  research  on  DIH 
demonstrates  that  these  structures  provide  a 
very  natural  environment  for  performing  basic 
types  of  inference  on  statements.  The 
subsequent  sections  show  how  these 
inferences  are  performed  using  the  DIH 
representation. 

4.  DIH  Traces 

To  describe  the  DIH  knowledge 
representation,  let  us  start  by  representing  the 
following  statement;  "It  is  certain  that  some 
power  plants  in  New  York  have  mechanical 
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The  trace  representing  the  sentence  consists  of  nodes  linked  by  dotted  lines.  The  arrows  in  the 
trace  incBcate  the  argument  (reference  set)  that  is  being  described  by  the  sentence.  The 
interpretation  of  the  trace  is  given  by  schema  hierarchy  SH1  in  Figure  3. 


Figure  2:  A  DIH  trace  representing  the  sentence  “It  is  certain  that  some  power  plants  in 
New  York  have  mechanical  failures.” 


failures."  Figure  2  presents  this  statement  as  a 
trace  connecting  nodes  of  five  hierarchies: 
“Process  plants”  and  "Failure",  both  type 
hierarchies;  “Quantification”,  the  quantifi¬ 
cation  hierarchy;  “Location”,  a  part  hierarchy; 
and  “Belief  measure”  a  measure  hierarchy. 

The  interpretation  of  the  trace  is  done  on  the 


basis  of  the  schema  hierarchy  shown  in  Figure 
3.  The  schema  defines  the  universe  of 
sentences  that  can  be  generated  using  concepts 
of  these  hierarchies,  ordered  according  to  the 
schema. 

The  convention  for  the  direction  of  arrows  in  a 
trace  is  that  they  point  from  the  nodes 


Rgure  3:  Schema  hierarchy  SHI. 


denoting  descriptive  concepts  to  the  argument 
node  that  stands  for  the  set  (or  individual) 
being  described,  called  a  reference  set.  In  this 
example,  the  set  being  described  is  “Power 
plant”  in  the  hierarchy  of  Process  Plants,  thus 
the  node  representing  it  is  the  argument  node. 
Other  nodes  linked  by  the  trace  represent 
descriptive  concepts  for  the  argument  node. 
The  belief  measure  takes  values  from  a  belief 
hierarchy,  and  refers  to  the  entire  trace  rather 
than  a  single  node,  which  is  indicated  by  the 
schema. 

Using  the  formalism  of  the  annotated 
predicate  logic  (Michalski,  1983),  this  trace 
can  be  interpreted  as:  "(Some)x,  [type(x)  = 
Power  plant)  &  [location(x)  =  New  York]  & 
[failure(x)  =  mechanical]:  Belief  =  1.0.”  This 
statement  is  a  quantified  conjunction  of 
several  elementary  statements.  An  elementary 
statement  expresses  one  property  of  the 
reference  node  (set),  for  example, 
“Location(Power  plant)  =  New  York.” 

In  a  formal  expression  of  an  elementary 
statement,  the  reference  set  (“Power  plant**)  is 
called  an  argument,  the  predicate  (**Location”) 
is  called  a  descriptor,  and  the  value  of  the 
descriptor  (**New  York”)  is  called  the  referent. 
Thus,  an  elementary  statement  is  formally 
expressed  in  the  form  **descriptor(argument)  = 
referent". 

In  Figure  2,  the  square  boxes  contain  the 
heading  of  the  hierarchy.  The  concept 
specified  in  the  heading  is  the  general 
descriptor  for  the  hierarchy.  The  nodes  in  the 
hierarchy  are  possible  values  of  this 
descriptor. 

The  schema  hierarchy,  SHI,  in  Figure  3  is 
used  for  the  interpretation  of  the  trace 
represented  in  Figure  2.  The  heading  indicates 
the  type  of  hierarchy  (SH:  Schema  Hierarchy) 


and  the  reference  set  of  the  trace.  Since  the 
schema  hierarchy  is  a  precedence  hierarchy,  a 
valid  interpretation  of  the  schema  requires 
each  of  the  descriptors  in  order.  Thus  the  first 
element  of  the  trace  must  be  from  the 
quantification  hierarchy,  the  second  from  the 
failure  hierarchy,  the  third  from  the  location 
hierarchy  and  the  last  from  the  hierarchy  of 
belief  measures.  This  schema  hierarchy  is  also 
utilized  for  examples  in  Section  4. 

Adding  knowledge  to  the  DIH  representation 
is  done  by  creating  hierarchies  and  specifying 
traces  that  express  statements  involving  nodes 
of  different  hierarchies.  To  allow  proper 
interpretation  of  a  trace,  the  schema  is  also 
specified  by  indicating  relevant  descriptors 
and  their  order. 

DIH  allows  one  to  represent  complex  forms  of 
knowledge,  involving  different  kinds  of 
quantifiers,  multi-argument  predicates, 
different  types  of  logical  operations  on  them, 
and  to  associate  degrees  of  belief  with 
individual  statements.  A  more  complete 
description  of  the  DIH  representation  system 
is  given  in  (Hieb  &  Michalski,  1993). 

5.  Multitype  Inference  in  DIH 

The  core  theory  of  plausible  reasoning 
introduced  in  (Collins  &  Michalski,  1989) 
gives  four  knowledge  transmutation  operators 
(also  called  transforms)  -  generalization, 
specialization,  similization  and  dissim- 
ilization.  The  Inferential  Theory  of  Learning 
(Michalski,  1993)  specifies  several  additional 
operators,  of  which  abstraction  and  concretion 
are  incorporated  into  DIH.  (In  (Collins  and 
Michalski,  1989),  the  abstraction  and 
concretion  transmutations  were  called  referent 
generalization  and  referent  specialization, 
respectively.) 
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Transmutation 

Symbol 

Relevant  Hierardiies 

Inference  Type 

Argument  Generalization 

AGen 

Type,  Part 

Deductive 

Argument  Specialization 

ASpec 

Type,  Part 

Inductive 

Quantification  Generalization 

QGen 

Quantification 

Inductive 

Quantification  Specialization 

QSpec 

Quantification 

Deductive 

Abstraction 

Abs 

Type,  Part,  Precedence 

Deductive 

Concretitm 

Con 

Type,  Part,  Precedence 

Inductive 

Argument  Similization 

ASim 

Type,  Part 

Analogical 

Argument  Dissimilizatirai 

ADis 

Type,  Part 

Analogical 

Referent  Similization 

RSim 

Type,  Part,  Precedence 

Analogical 

Referent  Dissimilization 

RDis 

Type,  Part,  Precedence 

Analogical 

Table  1:  Basic  knowledge  generation  transmutations. 


Generalization  (specialization)  transmutations 
extend  (contract)  the  reference  set.  They  are 
done  either  by  argument  generalization 
(specialization)  or  by  quantification 
generalization  (specialization).  Argument 
generalization  is  accomplished  by  moving 
above  the  node  representing  the  reference  set 
in  a  type  hierarchy.  Quantification  gener¬ 
alization  is  accomplished  by  moving  up  the 
quantification  hierarchy. 

Abstraction  (concretion)  transmutations 
decrease  (increase)  the  amount  of  information 
about  the  reference  set.  A  way  to  accomplish 
such  a  transmutation  is  by  moving  above  the 
node  in  the  type  or  part  hierarchy  that 
corresponds  to  a  value  of  some  descriptor  in 
the  sentence  represented  by  the  trace. 

Similization  (dissimilization)  transmutation  is 
done  by  replacing  a  node  corresponding  to  the 
reference  set  (argument)  or  a  descriptor  value 
(referent)  by  a  node  at  the  same  level  of 
hierarchy,  which  corresponds  to  a  similar 
(dissimilar)  concept  within  the  context  of  the 
given  hierarchy.  In  the  case  of  dissimilization, 
the  resulting  trace  is  linked  with  a  negation 
node,  because  the  generated  inference  is  a 
negation  of  the  original  sentence  (Michalski, 
1993). 


These  transmutations  can  be  given  a  simple 
conceptual  interpretation,  if  one  assumes  that 
nodes  at  each  level  of  hierarchy  are  ordered  by 
the  relation  of  similarity,  that  is,  nodes  that 
correspond  to  similar  concepts  (in  the  context 
of  the  given  hierarchy)  are  located  near  each 
other,  and  nodes  that  correspond  to  dissimilar 
concepts  are  placed  far  away  from  each  other. 
Such  an  arrangement  is  natural  for  precedence 
hierarchies.  In  sum,  similization  and 
dissimilization  transmutations  are  performed 
by  sideways  node  movements,  while 
generalization  (specialization)  and  abstracdon 
(concretion)  are  performed  by  upward 
(downward)  node  movements. 

Table  1  lists  all  the  above  knowledge 
transmutations,  specifying  their  abbreviated 
name,  the  relevant  hierarchies,  and  the 
underlying  inference  type.  The  relevant 
hierarchies  are  the  kinds  of  hierarchies  for 
which  the  transmutations  are  valid.  The 
various  kinds  of  part  hierarchies  are  not 
shown,  but  are  distinguished  in  DIH. 
Additional  constraints  are  necessary  in  some 
kinds  of  part  hierarchies  to  maintain  the 
validity  of  the  transmutation. 

Figure  4  presents  a  schematic  diagram 
illustrating  how  knowledge  transmutations 
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modify  a  trace.  A  dotted  line  represents  a  link 
in  a  trace.  An  arrow  means  that  the  trace  is 
moving  to  a  new  node  in  the  indicated 
direction  by  performing  the  indicated 
transmutation.  The  quantiHcation  transmuta¬ 
tions  operate  over  the  entire  trace,  rather  than 
on  a  single  node,  as  do  the  transformations 
involving  the  merit  parameters. 

One  form  of  generalization  transmutation 
moves  a  node  in  the  quantification  hierarchy 
upward,  another  form  moves  a  node 
(argument)  in  the  type  hierarchy  upward.  The 
"+"  indicates  a  strengthening  of  a  merit 
parameter,  or  the  movement  of  the  link  to  a 
node  that  is  "higher"  in  the  particular  merit 
parameter  measure  hierarchy.  The  "-" 
indicates  a  weakening  of  the  merit  parameter, 
or  the  movement  of  the  link  down  in  the 
hierarchy. 

Moving  a  node  in  a  trace  in  a  manner  that 
corresponds  to  a  deductive  inference  (Table  1) 
produces  a  new  trace  (statement)  with  the 


same  truth  status  as  the  original  trace.  In  the 
case  of  node  movement  that  corresponds  to 
inductive  or  analogical  inference,  the  smaller 
the  node  movement  (“perturbation"),  the  more 
plausible  the  resulting  inference. 

The  Argument  Generalization  transmutation 
represents  a  deductive  inference.  The 
abstraction  operation  is  also  deductive.  In 
contrast.  Argument  Specialization,  Quan- 
titication  generalization  and  Concretion  are 
inductive,  because  they  produce  traces 
(statements)  that  logically  entail  the  original 
traces  (statements). 

The  above  transmutations  can  be  usually  done 
in  a  number  of  different  ways,  by  moving  to 
different  alternative  nodes.  The  plausibility  of 
the  generated  statements  depends  on 
additional  merit  parameters,  such  as 
dominance,  typicality,  multiplicity,  similarity, 
frequency,  etc.  (Collins  and  Michalski,  1989). 
These  issues  will  be  the  subject  of  future 
research. 


A  argument  (the  set  being  described;  the  reference  set) 

R  referent  (value  of  the  descriptor  characterizing  the  argument) 

D  descriptor  (relationship  characterizing  the  argument) 

Q  quantification 

MP  one  or  more  of  the  merit  parameters 


a  link  in  a  trace 

moving  a  node  in  the  direction  of  the  arrow  performs  the  indicated  transmutation 


+ 


t 

MP 

4 


Figure  4:  Diagram  of  knowledge  transmutations  in  DIH. 


6.  Visualizing  DIH-Based  Inference 

This  section  illustrates  several  basic 
transmutations  through  a  series  of  self- 
explanatory  examples.  These  examples 
involve  the  same  original  statement,  rep¬ 
resented  as  a  trace  in  Figure  2.  Given  the 
original  statement,  these  transmutations 
generate  new  statements  illustrated  by  DIH 
traces  in  Figures  5  through  12. 


Input  to  Tiancmutatlon 
Output  from  Trananutation 
Oiroetkin  of  Transmutation 


The  legend  above  is  used  for  interpreting  the 
following  tigures.  The  input  statement  is  the 
same  as  that  of  Figure  2,  without  the  belief 
measure  hierarchy.  All  of  the  examples  are  in¬ 
terpreted  according  to  the  schema  SHI  shown 


in  Figure  3. 

There  are  two  referents  in  the  input  statement 
The  resulting  statements  (output)  show  the 
results  of  the  given  transmutation  assuming 
that  there  are  no  merit  parameters  that  assist  in 
the  specialization  or  concretion  and  that  the 
similization  operator  finds  a  single  “most 
similar”  node  using  the  descriptors  given.  The 
Background  Knowledge  (BK)  is  the  learner’s 
prior  knowledge  that  is  relevant  to  the  learning 
process. 

7.  MTL-DIH  System 

The  research  on  DIH  aims  at  ci<  ping  a 
representation  that  will  facilitate  all  basic 
inferential  strategies  and  knowledge 
transmutations  to  be  implemented  in  the 
multistrategy  task-adaptive  learning  system 
(MTL-DIH). 


Figure  5:  Inductive  generalization  based  on  quantification. 
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Input :  Some  power  plants  in  New  Yoik  have  mechanical  failures 

JBK; _ Inrtirar^  _ 

Output:  Some  power  plants  in  New  York  have  ctxnptment  defects 


Input :  Stxne  power  plants  in  New  Yrak  have  mechanical  failures 

_ Indicated  Hieiardiies  _ 

Output :  Some  chemical  plants  in  New  Yc^  have  mechanical  failures 


Figure  11:  Argument  similization  transmutation. 
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learner’s  prior  knowledge  that  is  relevant  to 
the  input  and  the  learning  goal. 

The  learning  goal  specifies  criteria 
characterizing  knowledge  to  be  learned.  There 
are  different  kinds  of  learning  goals,  such  as  to 
predict  new  information,  to  explain  the  input, 
to  classify  a  fact  or  concept  instance,  to  create 
an  abstract  description  from  an  operational 
one  or  conversely,  to  create  a  problem  solution 
or  a  plan.  It  is  assumed  that  the  learning  goal 
is  determined  by  a  teacher  or  by  the  control 
module  of  the  system. 

The  learning  process  involves  determining  the 
type  of  relationship  between  the  given  input 
and  the  background  knowledge,  and 
performing  a  sequence  of  knowledge 
transmutations,  involving  input  and 
background  knowledge,  to  produce  knowledge 
satisfying  the  learning  goal. 


Although  issues  related  to  the  implementation 
of  an  MTL  system  are  beyond  the  scope  of 
this  paper,  we  will  briefly  outline  the  basic 
ideas.  We  have  been  pursuing  two  approaches, 
MTL-JT,  which  builds  a  plausible  justification 
tree  to  "understand"  a  user's  input  (Tecuci, 
1993),  and  a  second  one,  MTL-DIH,  based  on 
DIH. 

In  the  MTL-DIH  approach,  a  learning  strategy 
is  determined  by  analyzing  the  learning  task. 
This  analysis  relates  the  input  information  to 
the  learner’s  background  knowledge  and  the 
learning  goal.  The  input  information  to  the 
system  is  assumed  to  be  given  in  the  form  of 
logic  statements.  It  can  be  concept  examples, 
concept  descriptions,  rule  examples,  rules  or  a 
combination  of  the  above.  The  system  re¬ 
represents  the  input  as  a  trace,  or  set  of  traces. 
Background  knowledge  is  the  part  of  the 
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8.  Summary  and  Future  Research 

The  DIH  knowledge  representation  presented 
serves  as  the  basis  for  implementing 
multistrategy  task-adaptive  learning.  It  builds 
upon  ideas  of  the  Inferential  Theory  of 
Learning  and  the  core  theory  of  plausible 
reasoning.  Although  it  is  closely  related  to  the 
semantic  network  representation,  it  represents 
a  significantly  different  approach,  and 
contains  many  new  ideas  that  make  it 
particularly  useful  for  representing  multitype 
inference.  These  include  the  idea  of  dividing 
the  knowledge  representation  into  a  static  part 
and  a  dynamic  part,  the  organization  of 
knowledge  in  which  basic  forms  of  inference 
can  be  performed  via  simple  trace 
perturbations,  and  the  introduction  of  various 
precedence  hierarchies,  such  as  the  schema 
hierarchy,  the  measure  hierarchy,  and  the 
quantification  hierarchy. 

The  primary  purpose  of  this  paper  was  to 
demonstrate  how  DIH  supports  several  basic 
knowledge  generation  transmutations,  specifi¬ 
cally,  generalization,  specialization,  abstrac¬ 
tion,  concretion,  similization  and  dissimiliza- 
tion.  The  first  version  of  DIH  has  been 
implemented  in  Smalltalk,  and  used  as  a  tool 
for  investigating  the  interactive  display  and 
modification  of  traces  in  hierarchies.  The 
visual  display  of  inference  is  particularly 
useful  in  situations  that  involve  traces 
connecting  only  a  few  hierarchies  (that  is, 
representing  short  sentences).  To  facilitate 
knowledge  visualization,  the  system  has  an 
option  to  present  traces  with  only  a  limited 
number  of  neighboring  nodes,  rather  then 
connecting  complete  hierarchies. 

In  DIH,  the  more  knowledge  structures  there 
are  in  background  knowledge,  the  easier  it  is 
to  assimilate  new  knowledge,  or  to  plausibly 


explain  input  statements.  DIH  is  an  efficient, 
representation,  because  most  knowledge 
modifications  consist  of  forming  or  changing 
traces,  without  affecting  the  established 
hierarchies. 

Many  issues  remain  to  be  addressed  in  future 
research.  Among  these  issues  are  the 
representation  of  more  complex  forms  of 
knowledge — mutual  implications,  various^ 
types  of  dependencies,  temporal  and  spatial 
knowledge,  and  the  development  of  methods 
for  determining  the  affect  of  merit  parameters 
on  the  reasoning  process. 
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Abstract 

When  solving  homework  exercises,  human  stu¬ 
dents  often  notice  that  the  problem  they  are 
about  to  solve  is  similar  to  an  example.  They 
then  deliberate  over  whether  to  refer  to  the  ex¬ 
ample  or  to  solve  the  problem  without  looking  at 
the  example.  We  present  protocol  analyses  show¬ 
ing  that  effective  human  learners  prefer  not  to 
use  analogical  problem  solving  for  achieving  the 
base-level  goals  of  the  problem,  although  they  do 
use  it  occasionally  for  achieving  meta-level  goals, 
such  as  checking  solutions  or  resolving  certain 
kinds  of  impasses.  On  the  other  hand,  ineffec¬ 
tive  learners  use  analogical  problem  solving  in 
place  of  ordinary  problem  solving,  and  this  pre¬ 
vents  them  from  discovering  gaps  in  their  domain 
theory.  An  analysis  of  the  task  domain  (college 
physics)  reveals  a  testable  heuristic  for  when  to 
use  analogy  and  when  to  avoid  it.  The  heuristic 
may  be  of  use  in  guiding  multistrategy  learners. 

Keywords:  incomplete  theories,  human  skill 
acquisition,  multistrategy  learning,  protocol 
analysis. 

1  When  To  Use  Analogical 
Problem  Solving? 

When  doing  homework  exercises,  human  learn¬ 
ers  often  notice  that  the  problem  they  are  about 


to  solve  is  similar  to  an  example,  then  deliber¬ 
ate  over  whether  to  refer  to  the  example  or  to 
solve  the  problem  without  its  aid.  As  one  of 
our  subjects  said,  “this  looks  very  much  like  the 
one  I  had  in  the  examples.  Okay.  Should  I  Just 
go  right  to  the  problem,  which  I  distinctly  re¬ 
member?  Or  should  I  try  to  do  it  without  look¬ 
ing  at  the  example?”  A  multistrategy  machine 
learning  program  could  face  the  same  decision. 
The  objective  of  this  paper  is  to  find  out  what 
heuristics  good  human  learners  use  for  deciding 
whether  to  do  analogical  problem  solving,  then 
determine  when  those  heuristics  would  be  good 
for  a  machine  learning  program  to  use. 

Because  we  use  protocol  data,  the  only  evi¬ 
dence  we  have  of  analogical  problem  solving  is 
episodes  where  a  person  explicitly  refers  to  an  ex¬ 
ample,  typically  by  flipping  pages  in  a  textbook, 
in  order  to  expose  the  page  on  which  the  exam¬ 
ple  is  printed.  Thus,  analogical  problem  solving, 
in  this  paper,  means  the  process  of  referring  to  a 
written  example  rather  than  a  mentally  held  one. 
As  will  be  seen  later,  nothing  in  our  conclusions 
relies  on  this  restriction,  so  the  results  may  ap¬ 
ply  to  analogies  that'  refer  to  mental  examples 
(or  cases?)  as  well  as  written  ones. 

The  protocol  data  come  from  subjects  learn¬ 
ing  Newtonian  physics.  The  subjects  worked 
with  textbook  physics  problems  and  examples, 
such  as  the  one  in  figure  1.  The  protocol  data 
were  collected  as  part  of  a  study  by  Chi,  Bassok, 
Lewis,  Rdmann  &  Glaser  (1989).  The  subjects 
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Problem:  The  figure  on  the  left  below  diows  a  Mock  of  mass  m  kqx  at  rest  cm  a  smooth  plane, 
inclined  at  an  angle  of  45  degrees  with  the  horizontal,  by  measn  of  a  string  attached  to  the  verti- 
cal  wall.  What  are  the  magnitudes  of  the  forces  acting  wi  the  block? 


Solution: 

(1)  We  choose  the  block  as  the  body. 

(2)  The  forces  acting  on  the  block  arc  shown  in  the  fiee-body  diagram  on  the  right. 

(3)  Since  the  block  is  unaccelerated,  we  obtain  T+N+mg=0.  j 

(4)  It  is  convenient  to  choose  the  x-axis  of  our  reference  frame  to  be  along  the  incline  and 
the  y-axis  to  be  normal  to  the  incline. 

(5)  With  this  choice  of  coordinates,  only  one  force,  mg,  must  be  resolved  into  components. 

(6)  The  two  scalar  equations  obtained  by  resolving  mg  along  the  x-  and  y-axes  are: 

T  -  mg  sin  45  =  0  and  N  -  mg  cos  45  =  0. 

(7)  From  these  equations,  T  and  N  can  be  obtained  if  m  is  given. 


Figure  1:  A  physics  example,  with  line  numbers  added 


were  9  college  students  selected  to  have  similar 
backgrounds  (Chi  &  VanLehn,  1991).  The  sub¬ 
jects  first  refreshed  their  mathematical  knowl¬ 
edge  by  studying  the  first  4  chapters  of  Halliday 
&  Resnick  (1981),  a  popular  physics  textbook. 
They  then  studied  the  expository  part  of  chapter 
5,  which  introduces  the  basic  principles  of  New¬ 
tonian  mechanics,  its  history  and  some  classic  ex¬ 
periments.  Student  were  tested  at  this  point  and 
had  to  re-study  parts  of  the  material  that  they 
did  not  understand.  After  they  had  mastered  the 
mathematical  prerequisites  and  the  basic  prin¬ 
ciples,  they  studied  3  examples  and  solved  19 
problems  while  talking  aloud.  They  were  allowed 
to  refer  to  the  examples  at  any  time  while  solv¬ 
ing  problems,  but  they  were  not  allowed  to  refer 
to  their  own  previous  problem  solutions.  The  9 
subjects’  protocols,  which  averaged  5  hours  each, 
are  the  raw  data  for  the  findings  reported  here. 
They  contain  many  instances  of  analogical  prob¬ 
lem  solving.  The  goal  is  to  discover  which  ones 
helped  leairning  and  which  ones  hurt  it. 

We  used  a  contrastive  protocol  analysis  tech¬ 


nique  pioneered  by  Chi  et  al.  (1989).  The  basic 
idea  is  to  split  the  subjects  into  two  groups — 
effective  learners  and  ineffective  learners — then 
determine  what  the  effective  learners  did  differ¬ 
ently  from  the  ineffective  learners.  Because  the 
students  were  trained  to  have  the  same  prereq¬ 
uisite  knowledge,  the  scores  on  their  problem 
solving  reflect  their  learning  rate  during  exam¬ 
ple  studying  and  problem  solving.  The  4  highest 
scoring  subjects  constitute  the  effective  learners 
(called  Good  solvers  by  Chi  et  al.),  and  the  4 
lowest  scoring  subjects  constituted  the  ineffec¬ 
tive  learners  (called  Poor  solvers).  The  middle 
subject’s  protocol  was  not  analyzed  (until  later: 
see  below). 

The  next  section  presents  a  learning  mecha¬ 
nism  and  argues  that  it  is  the  main  source  of 
learning  by  subjects  in  this  study.  The  argu¬ 
ment  uses  new  protocol  analyses  as  weU  as  anal¬ 
yses  published  earlier.  With  this  as  background, 
the  subsequent  sections  present  the  main  result, 
which  is  that  effective  learners  use  analogical 
problem  solving  sparingly.  A  discussion  section 
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speculates  on  why  this  policy  was  better  for  hu¬ 
man  solvers  in  this  experiment,  and  suggests  con¬ 
ditions  under  which  this  policy  would  be  good  for 
any  multi-strategy  learner. 

2  Gap  Filling 

Given  that  errors  are  used  to  determine  when 
learning  was  not  effective,  a  direct  way  to  un¬ 
cover  dominant  learning  mechanisms  is  to  exam¬ 
ine  the  subjects’  errors.  If  errors  of  a  certain 
type  are  much  less  common  among  Good  solvers 
than  Poor  solvers,  we  can  assume  that  a  learn¬ 
ing  mechanism  employed  by  the  Good  solvers 
and  not  the  Poor  solvers  is  reducing  those  er¬ 
rors.  From  the  characteristics  of  such  errors  we 
can  infer  the  characteristics  of  the  learning  pro¬ 
cesses.  We  classified  errors  into  5  types,  which 
are  listed  below: 

•  Inappropriate  analogies.  Sometimes  sub¬ 
jects  fetched  an  example  that  was  inappro¬ 
priate  for  the  problem  being  solved.  At 
other  times,  subjects  fetched  appropriate 
examples  but  applied  them  in  inappropri¬ 
ate  ways.  Both  types  .'f  errors  are  classified 
as  inappropriate  analogies. 

•  Gap  errors.  Subjects  often  lacked  a  piece 
of  physics  knowledge,  such  as  the  fact  that 
the  tension  in  a  string  is  equal  to  the  mag¬ 
nitude  of  a  tension  force  exerted  by  that 
string.  Sometimes  errors  would  occur  when 
the  subject  reached  an  impasse  caused  by 
their  lack  of  knowledge,  and  used  some  in¬ 
effective  repair  strategy  (VanLehn,  1990)  to 
work  around  it.  At  other  times,  the  gap 
would  cause  an  error  (such  as  a  missing  mi¬ 
nus  sign)  without  the  subject  ever  becoming 
aware  of  the  gap. 

•  Schema  selection  errors.  AH  subjects  knew 
several  methods  or  schemas  for  solving 
physics  problems.  One  method  was  to 
draw  forces,  generate  equations  and  solve 
the  equations.  Another  method  was  sim¬ 
ply  to  generate  equations  that  contained 
the  sought  and/or  known  quantities  without 
considering  what  forces  or  other  physical 


Table  1:  Mean  errors  per  subject  for  each  error 
category 


Error  type 

Good 

Poor 

Inappropriate  analogies 

1.00 

2.25 

Gap  errors 

**0.25 

•*7.75 

Schema  selection 

0.50 

1.75 

Math  errors 

0.25 

0.75 

Miscellaneous  errors 

1.25 

1.75 

Totals 

**3.25 

**14.25 

quantities  might  be  present.  On  some  prob¬ 
lems,  subjects  chose  the  equation-chaining 
schema  instead  of  the  force  schema,  and  this 
caused  them  to  answer  the  problem  incor¬ 
rectly. 

•  Mathematical  errors.  A  typical  mathemati¬ 
cal  error  was  to  confuse  sine  and  cosine,  or 
to  drop  a  negative  sign. 

•  Miscellaneous  errors. 

The  error  classification  was  done  separately  by 
two  coders,  with  an  intercoder  reliability  of  82%. 
Differences  were  reconciled  by  collaborative  pro¬ 
tocol  analysis. 

Table  1  shows  the  average  number  of  errors 
of  each  type  per  subject.  Although  the  Good 
solvers  had  fewer  errors  than  the  Poor  solvers 
in  every  category,  the  difference  was  significant 
only  for  gap  errors  (t(6)  =  5.36, p  <  .01).  More¬ 
over,  the  difference  was  quite  large  (3.8  standard 
deviations),  and  accounts  for  most  (68%)  of  the 
difference  in  the  total  error  rates  of  the  Good 
and  Poor  solvers. 

These  results  suggest  that  Good  solvers  were 
more  effective  learners  than  Poor  solvers  because 
they  employed  some  kind  of  learning  process  that 
filled  in  the  gaps  in  their  knowledge.^  There  are 
many  kinds  of  mechanisms  in  the  literature  that 

^An  slternative  explanation  it  that  the  Good  aolvert 
never  had  the  gaps  because  they  learned  the  knowledge 
before  studying  the  examples.  Analyses  of  pre-test  data 
(Chi  et  al.,  1989),  the  instructional  material  (VanLehn, 
Jones  tc  Chi,  1991)  and  the  subjects’  backgrounds  (Chi 
&  VanLehn,  1991)  fail  to  support  this  interpretation  of 
the  data. 
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can  detect  and  rectify  incomplete  domain  the¬ 
ories.  For  handy  reference,  let  us  refer  to  the 
process(es)  that  Good  solvers  use  as  gap  filling 
even  though  we  do  not  know  what  it  is.  Table  1 
suggests  that  gap  filling  is  the  main  learning  pro¬ 
cess  that  differentiates  effective  from  ineffective 
learners  in  this  study. 

This  suggestion  is  consistent  with  findings 
from  Chi  et  al.’s  (1989)  analysis  of  the  same 
data.  They  found  that  during  example  studying, 
Good  solvers  tended  to  thoroughly  explain  the 
examples  to  themselves,  while  the  Poor  solvers 
tended  to  read  them  rather  casually.  Further 
examination  of  the  protocols  suggested  that  self¬ 
explanation  consisted  of  actually  rederiving  the 
lines  of  the  solution  (VanLehn,  Jones  &  Chi, 
1991).  If  the  main  learning  process  is  gap  fill¬ 
ing,  then  this  method  of  studying  the  example 
should  cause  the  subjects  to  detect  gaps  in  their 
knowledge.  If  a  piece  of  physics  knowledge  is 
required  for  deriving  a  line  of  the  example’s  so¬ 
lution,  and  students  lack  that  knowledge,  then 
they  wiU  be  unable  to  fully  explain  the  line.  The 
resulting  impasse  might  cause  them  to  seek  the 
missing  knowledge  and  fUl  their  gap.  Thus,  the 
gap-filling  hypothesis  is  consistent  with  the  find¬ 
ing  that  Good  solvers  self-explaln  examples  more 
than  Poor  solvers.  Moreover,  it  explains  why 
self-explanation  causes  better  learning  (VanLehn 
&  Jones,  in  press-b). 

The  computational  sufficiency  of  gap-filling 
has  been  tested  by  implementing  simulation  of 
human  learning,  called  Cascade,  that  is  based  a 
particular  gap  filling  mechanism  and  comparing 
Cascade’s  behavior  to  the  protocols  (VanLehn, 
Jones  &  Chi,  1991;  VanLehn  &  Jones,  1991;  in 
press-a).  Gaps  can  cause  Cascade  to  reach  im¬ 
passes  (i.e.,  be  unable  to  achieve  a  goal)  while 
trying  to  solve  problems  or  rederive  examples. 
When  Cascade’s  “official”  domain  knowledge  is 
insufficient  to  achieve  a  goal,  it  tries  to  apply 
overly  general  knowledge  that  captures  regular¬ 
ities  common  to  many  types  of  scientific  and 
mathematical  problem  solving.  For  instance,  one 
overly  general  rule  is  that  scientific  concepts  of¬ 
ten  correspond  roughly  to  common  sense  con¬ 
cepts.  Cascade  uses  this  rule  during  a  problem 
where  a  block  rests  on  a  spring.  Cascade  lacks 


the  knowledge  that  a  compressed  spring  exerts  a 
force  on  the  objects  at  its  ends,  so  it  reaches  an 
impasse.  The  overly  general  rule  applies,  because 
Cascade  knows  that  springs  push  back  when  you 
push  on  them.  The  overly  general  rule  justi¬ 
fies  creating  an  instance  of  a  scientific  concept 
(“force”)  because  it  involves  the  same  objects  as 
an  instance  of  a  lay  concept  (“push  back”).  As 
a  side-effect  of  the  application  of  this  overly  gen¬ 
eral  knowledge,  a  new  domain  rule  is  proposed: 
If  a  block  rests  on  a  spring,  the  spring* exerts 
a  force  on  it.  If  this  rule  is  used  successfully 
enough  times,  it  becomes  a  fuD-fledged  member 
of  the  domain  theory.  In  this  fashion.  Cascade 
fills  gaps  in  its  domain  knowledge. 

Cascade’s  behavior  compares  well  with  both 
aggregate  findings  (VanLehn,  Jones  &  Chi,  1991) 
and  individual  protocols  (VanLehn  &  Jones, 
1993).  This  establishes  that  with  plausible 
assumptions  about  subjects’  prior  knowledge, 
there  is  enough  information  present  in  the  en¬ 
vironment  to  allow  a  gap  filling  process  to  learn 
everything  that  the  Good  solvers  learn,  to  do 
so  without  implausibly  large  computations,  and 
to  generate  outward  behavior  that  is  similar  to 
the  subjects’  behavior.  In  short,  gap  filling  is  a 
computationally  sufficient  account  for  the  Good 
solvers’  learning. 

None  of  the  results  show  that  gap  filling  is  the 
only  learning  process  going  on.  There  rould  be 
others  as  wc’J.  However,  Table  1  sugg«.jts  that 
gap  filling  is  the  most  important  learning  pro¬ 
cess,  because  it  accounts  for  most  of  the  dif¬ 
ference  in  the  learning  of  the  Good  and  Poor 
solvers. 

3  Avoiding  Analogical  Prob¬ 
lem  Solving 

There  is  already  some  evidence  that  the  Good 
solvers  avoid  analogical  problem  solving.  This 
section  reviews  those  findings,  then  tries  to  as¬ 
certain  whether  this  is  a  just  a  correlation  or 
whether  avoiding  analogy  actually  causes  more 
effective  learning. 

Chi  et  al.  (1989)  counted  episodes  of  analogi¬ 
cal  problem  solving  during  the  first  3  problems. 


They  found  that  Good  solvers  used  analogy  only 
2.7  times  per  problem,  whereas  the  Poor  solvers 
used  analogy  6.7  items  per  problem.  Thus,  the 
Good  solver  use  analogical  problem  solving  less 
often  than  the  Poor  solvers.  Chi  et  al.  also  found 
that  the  Good  solvers  used  analogy  in  a  more  fo¬ 
cused  way.  When  the  Good  solvers  referred  to  an 
example,  they  tended  to  jump  into  the  middle  of 
it  and  read  only  a  few  lines  (1.6  lines  per  episode, 
on  average).  The  Poor  solvers  tended  to  start  at 
the  beginning  of  the  example  and  read  the  whole 
thing  or  until  they  found  something  they  could 
use  (13.0  lines  per  episode).  This  suggests  that 
Good  solvers  are  basically  solving  the  problem 
on  their  own,  but  they  occasionally  use  analog¬ 
ical  problem  solving  to  get  specific  information 
from  the  example.  The  Poor  solvers,  on  the  other 
hand,  seem  to  use  analogical  problem  solving  in¬ 
stead  of  regular  problem  solving.  These  findings 
indicate  a  correlation  between  effective  learning 
and  avoiding  of  analogy,  but  it  is  not  clear  which 
way  the  causality  runs. 

Our  first  hypothesis  was  that  the  Poor  solvers 
used  more  analogical  problem  solving  because 
they  lacked  domain  knowledge  so  they  had  to 
refer  to  the  example  if  they  were  to  make  any 
progress.  Cascade  embedded  this  hypothesis.  It 
did  analogy  (called  transformational  analogy  in 
earlier  reports)  only  when  it  reached  an  impasse 
(VanLehn,  Chi  &  Jones,  1991;  VanLehn  &  Jones, 
in  press-a).  On  this  account,  the  Chi  et  al.  corre¬ 
lation  is  due  to  ineffective  learning  causing  anal¬ 
ogy- 

However,  when  we  fitted  Cascade  to  individ¬ 
ual  protocols,  we  found  that  we  sometimes  had 
to  force  it  to  do  analogy  even  though  it  had  the 
knowledge  to  do  regular  problem  solving  (Van¬ 
Lehn  &  Jones,  1993).  While  simulating  all  9  sub¬ 
jects,  Cascade  used  analogy  231  times,  and  196 
of  these  were  caused  by  impasses  while  35  (15%) 
were  caused  by  our  intervention.  If  we  believe 
the  modeling,  then  these  35  analogies  were  “op¬ 
tional”  in  that  the  subjects  did  not  have  to  do 
them.  They  could  have  used  their  knowledge 
of  physics  principles  instead.  In  most  of  these 
cases  (30  of  35),  the  subjects  copied  the  exam¬ 
ple’s  force  diagram  rather  than  generate  their 
own.  Copying  the  force  diagram  was  also  fre¬ 


quent  among  the  196  impasse-driven  analogies. 

Upon  reflection,  it  occurred  to  us  maybe  some 
of  these  supposedly  impasse-driven  analogies 
were  not  actually  caused  by  trying  to  generate 
forces,  failing,  and  reaching  an  impasse.  If  this 
were  the  case,  then  one  would  expect  the  anal¬ 
ogy  to  yield  new  knowledge  about  the  missing 
force  (or  whatever  the  missing  knowledge  was), 
thus  filling  the  gap  and  allowing  the  person  to 
draw  their  own  force  diagram  the  next  time  it 
was  needed.  We  examined  all  196  cases  of  anal¬ 
ogy  and  found  no  cases  were  this  kind  of  learn¬ 
ing  occurred.  If  a  person  had  an  gap  that  caused 
an  impasse- driven  analogy,  then  they  would  use 
analogy  on  every  subsequent  occasion  (if  any) 
when  that  piece  of  knowledge  was  required.  It 
could  be  that  what  people  learn  from  such  an 
impasse  is  that  “analogy  works  here,”  so  they 
continue  to  use  it.  However,  it  could  also  be 
that  our  modeling  was  incorrect,  and  they  never 
had  such  any  impasses  for  that  gap.  Instead, 
when  they  go  to  certain  sections  of  the  prob¬ 
lem  (typically,  the  force  diagram),  they  would 
use  analogy  without  ever  considering  using  their 
domain  knowledge.  Perhaps  some  of  those  196 
cases  of  impasse-driven  analogy  were  really  op¬ 
tional  analogies.  Indeed,  two  of  the  subject  never 
tried  o  draw  a  force  diagram  on  their  own — they 
always  copied  an  example’s  diagram. 

While  investigating  the  gap-filling  hypothesis, 
we  discovered  additional  support  for  this  con¬ 
jecture.  According  to  the  gap-filling  hypothesis, 
gaps  in  the  textbook  become  gaps  in  the  stu¬ 
dent’s  domain  knowledge,  which  cause  errors  un¬ 
til  they  are  detected  and  remedied.  In  order  to 
check  this  story,  we  carefully  analyzed  the  first  5 
chapters  of  Halliday  and  Resnick  (1981)  and  dis¬ 
covered  9  pieces  of  knowledge  that  are  required 
by  the  problems  and  are  not  in  the  text  (Van¬ 
Lehn  &  Jones,  in  preparation).  Using  Cascade, 
for  each  of  the  9  subjects,  we  located  the  places 
in  the  protocols  where  the  9  pieces  of  knowledge 
could  appear  if  they  were  known,  or  cause  er¬ 
rors  if  they  were  unknown.  For  each  of  the  9 
pieces  of  knowledge,  we  created  a  chart,  such 
as  the  one  shown  in  Table  2,  that  summarizes 
what  happened  at  each  possible  occurrence  of 
the  gap.  The  particular  piece  of  knowledge  ref- 


erenced  by  Table  2  is  “Projecting  a  vector  onto 
the  negative  portion  of  an  axis  yields  a  negative 
formula.”  This  piece  of  knowledge  is  relevant 
5  times  during  example  studying  and  16  times 
during  problem  solving.  At  each  place,  for  each 
subject,  we  classified  the  protocol  fragment  into 
one  of  the  categories  shown  below  (the  symbol 
in  parentheses  corresponds  to  the  code  used  in 
Table  2). 

•  (£)  The  subject  omited  use  of  the  knowl¬ 
edge,  which  resulted  in  an  error. 

•  (0)  The  subject  omited  use  of  the  knowl¬ 
edge,  but  no  error  occurred.  For  instance, 
one  sign  error  might  compensate  for  an¬ 
other. 

•  (blank)  During  example  studying,  the  sub¬ 
ject  did  not  explain  Ihe  part  of  the  exam¬ 
ple  where  this  piece  of  knowledge  would  be 
used.  During  problem  solving,  the  subject 
used  analogical  problem  solving  to  avoid  the 
line  of  reasoning  that  would  use  the  piece  of 
knowledge. 

•  (U)  The  subject  used  the  piece  of  knowledge 
without  hesitation  or  other  signs  of  unusual 
processing. 

•  (L)  The  subject  seemed  to  learn  the  knowl¬ 
edge.  Episodes  received  this  code  if  the  sub¬ 
jects  expressed  puzzlement  or  commented 
on  their  lack  of  knowledge,  but  eventually 
came  up  with  the  right  action  (e.g.,  writ¬ 
ing  a  negative  sign).  For  instance,  subject 
P2  overlooked  the  first  minus  sign  in  the 
first  example,  but  on  the  second  minus  sign 
she  said,  “Hmm,  why  is  [it]  minus?  Uah 
Huh. . . .  Because  these  axis  are  starting  here 
so  this  is  minus.”  She  then  went  back  to  the 
first  minus  sign  and  said,  “How  about  the 
X’s.  It  should  also  be  a  minus.  Yah,  that 
was  a  minus.”  Subject  S102  paused  after 
seeing  the  second  minus  and  said,  “Negative 
W —  It’s  because  it’s  going  in  a  negative 
direction  it  points. .  .they  give  it  a  negative 
value  [if]  it’s  below  the  Y-axis.  I  mean  the 
X-axis.”  Subject  Pi  (quoted  at  length  in 


VanLehn,  Jones  &  Chi  1991),  took  several 
garden  paths  before  discovering  the  correct 
rule  for  explaining  the  minus  sign,  which 
is  clear  evidence  of  her  lack  of  knowledge. 
However,  her  verbal  behavior  at  the  time  of 
the  discovery  was  just  as  brief  and  cryptic 
as  the  verbal  behavior  of  P2  and  S102.  Such 
limited  verbal  evidence  is  typical  of  discov¬ 
ery  events  in  protocol  data  (VanLehn,  1991; 
Siegler  &  Jenkins,  1989).  They  nonetheless 
seem  to  reliably  mark  transitions  in  the  sub¬ 
jects’  knowledge. 

•  (R)  The  subjects’  verbal  behavior  indicates 
that  they  are  learning  the  piece  of  knowl¬ 
edge,  but  they  used  it  at  least  one  before. 
We  believe  these  are  cases  of  relearning. 

•  (?)  Protocol  missing. 

The  blanks  in  the  problem  solving  part  of  Ta¬ 
ble  2  shows  that  for  this  piece  of  knowledge, 
many  gaps  are  not  detected  because  the  student 
used  analogical  problem  solving.  When  we  con¬ 
structed  similar  analyses  for  all  9  gaps  and  all  9 
subjects,  we  found  that  of  the  81  (=  9x9)  cases 
where  a  piece  of  knowledge  could  be  learned,  in 
44  cases  (54%)  the  subject  avoided  all  places 
where  the  gap  could  be  detected  (as  did  S109 
in  Table  2).  This  analysis  clearly  indicates  that 
analogical  problem  solving  is  thwarting  gap  fill¬ 
ing  by  avoiding  lines  of  reasoning  that  would 
cause  the  gap  to  be  detected. 

This  finding  makes  intuitive  sense.  Most  prob¬ 
lems  can  be  solved  by  a  4  step  process;  select 
some  objects  as  the  “bodies”  (line  1  of  Figure  1), 
draw  a  diagram  for  each  body  showing  the  forces 
acting  on  it  (lines  2  and  3  of  Figure  1),  produce 
a  set  of  equations  (line  6  of  Figure  1),  then  solve 
the  equations  for  the  sought  quantity  (omitted 
in  Figure  1).  The  last  step  cannot  usually  be  re¬ 
placed  by  analogy  because  the  problems  seldom 
seek  the  same  quantities.  However,  the  first  3 
steps  can  often  be  achieved  by  analogy.  The  stu¬ 
dent  can  find  an  analogous  problem  and  copy 
either  its  force  diagram,  its  equations  or  both. 
A  student  who  does  this  avoids  using  the  force 
laws  (which  generate  forces)  and  Newton’s  laws 
(which  generate  the  equations).  Missing  physics 
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Table  2:  Places  where  the  negative- projection  rule  could  be  used 
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E 
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U 

E 

R 

u 

u 

u 

0 
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knowledge  can  remain  undetected  as  long  as  one 
uses  analogy  to  copy  force  diagrams  and  equa¬ 
tions.  To  put  it  bluntly,  analogical  problem  solv¬ 
ing  often  preserves  ignorance. 

We  can  now  understand  part  of  Chi  et  al.’s 
hnding  about  the  use  of  analogy  by  Good  and 
Poor  solvers.  The  Poor  solvers  displayed  more 
episodes  of  analogical  problem  solving  and  read 
more  lines  during  each  episode  because  they  gen¬ 
erally  avoided  generating  their  own  forces  and 
equations  by  copying  them  from  the  examples. 
This  is  not  just  a  coincidence,  but  seems  to  have 
caused  them  to  learn  much  less  than  they  would 
otherwise.  On  the  other  hand,  the  Good  solvers 
generally  tried  to  generate  their  own  forces  and 
equations  and  only  referred  sporadically  and 
briefly  to  the  examples.  Thus,  they  could  still 
detect  their  gaps  and  remedy  them. 

4  Using  Analogy  Sparingly 

We  suspect  that  the  Good  solvers’  use  of  analogy 
does  more  than  just  allow  gap-filling  to  operate. 
It  may  actively  aid  gap-filling  by  helping  to  both 
detect  and  fill  gaps.  This  section  presents  a  few 
pieces  of  protocol  data  to  support  our  conjecture. 
However,  more  data  are  clearly  required. 

There  were  6  cases  in  the  protocols  where,  ac¬ 
cording  to  the  analysis  above,  knowledge  was 
learned  during  problem  solving.  During  4  of  the 
6  episodes,  the  subject  clearly  referred  to  an  ex¬ 
ample.  Although  there  are  only  4  cases,  they  are 
different  enough  that  they  begin  to  show  that 


analogy  can  assist  gap  filling  in  several  ways.  To 
illustrate  this,  each  of  the  4  cases  will  be  pre¬ 
sented. 

In  one  case,  subject  P2  reached  an  impasse 
and  filled  it  with  the  aid  of  walogy.  The  subject 
was  given  a  problem  where  water  in  a  tube  sup¬ 
ported  a  block  and  kept  it  from  falling.  The  sub¬ 
ject  quickly  recognized  that  this  problem  is  simi¬ 
lar  an  example  where  a  block  hung  from  a  string 
which  prevented  it  from  falling.  However,  she  did 
not  refer  to  the  example  but  instead  went  to  work 
on  the  problem  by  drawing  its  forces.  Eventually 
she  realized  that  she  needed  to  know  whether  the 
water  exerted  a  force  on  the  block.  (The  text¬ 
book  had  never  mentioned  pressure  forces,  and 
the  subject  never  used  the  word  “pressure,"  so 
apparently  she  lacked  knowledge  of  this  kind  of 
force.)  She  thought  that  there  might  be  a  force 
analogous  to  the  tension  force  in  the  string,  but 
she  wasn’t  sure.  At  this  point  she  referred  to 
the  example,  presumably  in  order  to  ascertain 
its  similarity  to  the  problem.  She  eventually  de¬ 
cided  that  it  was  okay  to  assume  a  force  anal¬ 
ogous  to  the  tension  force.  This  seems  to  be  a 
case  of  an  analogy  assisting  in  the  formulation  of 
a  new  physics  conjecture  that  both  filled  a  gap  in 
the  subject’s  knowledge  and  resolved  a  problem 
solving  impasse.  ACT*  (Anderson,  1990)  and 
other  theories  claim  that  analogy  is  often  used  to 
resolve  impasses  and  thereby  acquire  new  knowl¬ 
edge. 

In  another  case,  subject  S105  generated  forces 
for  a  problem  without  referring  to  any  exam- 
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pies.  However,  he  failed  to  draw  one  force  (the 
surface  normal — a  notoriously  unintuitive  force). 
He  had  never  produced  that  force  in  earlier  prob¬ 
lem  solving,  nor  had  he  self-explained  the  line 
in  the  examples  that  was  intended  to  teach  it. 
Thus,  we  assume  he  had  a  knowledge  gap.  After 
drawing  forces,  the  subject  fetched  an  example, 
viewed  its  force  diagram,  said,  “That’s  the  force 
I  wasn’t  thinking  of,”  and  drew  a  normal  force 
on  his  diagram.  Thereafter,  the  subject  regu¬ 
larly  drew  normal  forces  on  his  diagrams.  In 
this  case,  analogy  was  used  to  check  a  step  in 
the  problem  solving,  and  that  revealed  a  knowl¬ 
edge  gap.  Whereas  the  preceding  case  illustrates 
how  analogy  can  help  in  idling  gaps,  this  case  il¬ 
lustrates  how  analogy  can  help  in  detecting  gaps. 
There  was  a  second  case  of  analogy-based  check¬ 
ing  causing  detection  of  a  gap,  but  it  will  not 
be  presented  here.  These  cases  are  consistent 
with  a  finding  of  Chi  et  al.  (1989),  who  classified 
analogical  episodes  as  either  reading,  checking  or 
copying.  Good  solvers  had  many  fewer  episodes 
of  reading  and  copying  than  Poor  solvers,  but 
they  actually  had  more  episodes  of  checking. 

In  the  last  case  of  learning  while  doing  analog¬ 
ical  problem  solving,  the  subject  was  engaged  in 
a  mixture  of  analogical  problem  solving  and  self¬ 
explanation.  Subject  SlOl  apparently  did  not 
know  that  projection  onto  the  negative  part  of 
an  axis  yields  a  negative  formula.  He  said,  “I’m 
trying  to  figure  out  why  these  are  negative,”  re¬ 
ferring  to  two  negative  signs  in  an  example.  He 
was  trying  in  vain  to  self-explain  the  example.  A 
moment  later  he  gave  up,  saying  “Well  let’s  see 
if  I  can  just  push  these  in.”  He  started  adapting 
equations  from  the  example  while  complaining, 
“This  is  called  copying  too  much  from  the  book. 
I  hate  that.”  This  is  clearly  a  case  of  analogi¬ 
cal  problem  solving  of  the  worst  kind.  However, 
after  solving  the  equations  and  producing  a  neg¬ 
ative  formula,  he  said,  “So,  according  to  this,  my 
x-component  is  equal  to  -9,  which  means,  okay. 
That  makes  sense.  That  makes  sense.  One  of 
these  has  to  ailways  be  negative,  doesn’t  it?”  Ap¬ 
parently,  the  subject  figured  out  why  there  is  a 
negative  sign  (although  it  appears  to  be  based 
on  some  kind  of  symmetry  argument,  which  is 
an  overly  general  line  of  mathematical  reasoning. 


but  not  the  best  one  for  learning  this  ride).  A 
bit  later  in  the  protocol,  he  fuled  to  use  his  new 
knowledge,  but  corrected  his  oversight  a  few  lines 
later  (we  coded  this  as  a  learning  event  in  Table 
2).  After  that,  he  used  the  rule  fairly  consis¬ 
tently.  This  rather  complicated  case  illustrates 
how  analogy  can  combine  with  self-explanation 
to  produce  learning  via  a  kind  of  justified  anal¬ 
ogy  (Kedar-Cabelli,  1985). 

Although  SlOl  was  one  of  the  Good  solvers, 
he  was  the  worst  of  the  group  (Chi  &  VaaiLehn, 
1992).  He  self-explained  some  parts  of  the  ex¬ 
amples,  but  he  tended  to  ignore  the  details.  In 
particular,  he  glossed  over  the  equations  with 
the  negative  signs  in  them.  We  suspect  that  he 
would  have  learned  more  if  he  had  self-explained 
the  examples  more  carefully  as  he  studied  them. 
His  discovery  of  the  negative-sign  rule  seemed  to 
go  much  less  smoothly  than  the  discoveries  of 
the  subjects  who  were  self-explaining  during  ex¬ 
ample  studying  (quoted  earlier).  Although  more 
evidence  is  certainly  needed  before  drawing  a 
firm  conclusion,  it  currently  appears  that  self¬ 
explanation  during  example  studying  might  be  a 
more  effective  than  self-explanation  in  the  con¬ 
text  of  analogical  problem  solving. 

We  believe  that  these  cases  are  just  a  few  of 
the  many  ways  that  analogy  can  combine  with 
self-explanations,  overly  general  rules,  and  other 
learning  techniques.  The  point  is,  however,  that 
these  uses  of  analogy  only  briefly  interrupt  reg¬ 
ular  problem  solving  in  order  to  achieve  spe¬ 
cific  meta-goals,  such  as  detecting  gaps  or  fill¬ 
ing  them.  Wholesale  analogy,  the  kind  used  by 
the  Poor  solvers,  avoids  detecting  gaps  and  thus 
tends  to  retard  learning.  Poor  solvers  use  anal¬ 
ogy  to  achieve  base-level  goals,  such  as  having  a 
force  diagram  or  having  a  set  of  equations.  How¬ 
ever,  the  distinction  between  “meta-goals”  and 
"base-level  goals”  is  notoriously  slippery,  so  the 
next  section  tries  to  formulate  a  better  heuristic 
for  when  to  use  analogy. 

5  Discussion 

The  preceding  sections  showed  that  in  one  study, 
effective  human  learners  used  analogical  problem 


solving  sparingly.  With  a  little  bit  of  compu¬ 
tational  common  sense,  we  can  generalize  this 
result  and  formulate  a  heuristic  for  when  such 
sparse  analogical  problem  solving  should  be  ef¬ 
fective. 

There  are  four  steps  to  splving  a  Newtonian 
mechanics  physics  problem: 

1.  Define  a  system.  One  must  (a)  select  an  ide¬ 
alization  of  the  physical  world  consisting  of 
idealized  bodies  that  have  idealized  relation¬ 
ships  to  other  objects  and  move  in  idealized 
trajectories,  and  (b)  decide  whether  to  base 
the  analysis  on  forces,  energies  or  momenta. 
In  Figure  1,  line  1  corresponds  to  part  a 
(albeit,  tersely),  and  part  b  is  missing  be¬ 
cause  the  text  has  only  introduced  one  type 
of  analysis  (forces)  at  this  point. 

2.  Expli(  .ite  physics  quantities.  For  each  body 
in  the  system,  one  notes  the  forces,  energies 
or  momenta  associated  with  that  body.  Of¬ 
ten,  a  diagram  is  drawn  to  help  one  remem¬ 
ber  them.  In  Figure  1,  this  occurs  during 
line  2. 

3.  Generate  equations.  Each  body  contributes 
some  equations,  the  coimections  between 
bodies  contribute  other  equations,  and  the 
problem’s  boundary  conditions  may  con¬ 
tribute  further  equations.  In  Figure  1,  the 
equations  are  produced  on  line  6. 

4.  Solve  the  equations  for  the  sought  quantity. 

Although  logicciUy  distinct,  these  steps  are  often 
intermingled  in  an  solver’s  work.  Solvers  have 
no  trouble  learning  this  basic  procedure.  It  is 
often  printed  in  the  textbook.  It  is  a  specializa¬ 
tion  of  the  general  3-step  procedure  (define  a  sys¬ 
tem,  formulate  a  mathematical  model,  solve  it) 
that  is  used  for  aU  mathematical  analysis  prob¬ 
lems,  from  lowly  arithmetic  and  algebraic  word 
problems  to  esoteric  branches  of  science  and  en¬ 
gineering  (see  any  textbook  on  systems  theory, 
e.g..  Shearer,  Murphy  &  Richardson,  1971). 

The  system  definition  step  is  quite  different 
from  the  others.  There  are  no  real  principles 
for  defining  a  system.  Sometimes  a  man  hold¬ 
ing  a  block  is  treated  as  one  body,  and  some¬ 


times  as  two.  Sometimes  a  chain  is  treated  as 
a  single  body,  sometimes  as  an  infinite  sequence 
of  infinitely  small  bodies,  and  sometimes  as  two 
bodies  (cf.  Larkin,  1983).  Most  textbooks  em¬ 
phasize  that  system  definition  is  more  of  an  art 
than  an  algorithm,  and  few  give  any  heuristics 
at  all  for  defining  systems.  Although  the  prob¬ 
lems  used  in  our  data  are  too  simple  to  reveal 
how  the  subjects  learn  about  system  definition, 
it  seems  quite  likely  that  one  way  that  system¬ 
defining  can  be  learned  is  by  analogical  problem 
solving  and  building  up  a  case  library  that  pairs 
problems  with  systems. 

However,  the  other  3  steps  are  governed  by 
well-known  principles,  such  as  the  force  laws  (for 
step  2),  Newton’s  laws  (for  step  3)  and  mathe¬ 
matical  transformations  (for  step  4).  Moreover, 
once  the  system  has  been  defined,  the  analysis  is 
completely  determined.  In  step  2  (explication  of 
physics  quantities),  one  produces  all  the  forces 
(or  energies  or  momenta)  acting  on  the  system’s 
bodies.  In  step  3  (equation  generation),  one  pro¬ 
duces  all  the  equations  implied  by  those  physics 
quantities.^  In  step  4,  the  equations  are  solved 
mechanically.  The  point  here  is  that  in  physics 
and  many  other  mathematical  analysis  task  do¬ 
mains,  the  most  important  decisions  are  made 
during  system  definition,  and  the  rest  of  the  anal¬ 
ysis  follows  more  or  less  deterministically  from 
those  choices.  Although  analogical  problem  solv¬ 
ing  would  be  useful  for  learning  search  control, 
search  control  is  not  very  important  for  steps  2 
and  3. 

Principles  can  be  used  for  problem  solving,  but 
this  does  not  mean  that  they  necessarily  should 
be  used.  It  may  be  that  case-based  reasoning  is 
more  effective  and/or  more  efficient  than  rule- 
based  reasoning.  If  so,  analogical  problem  solv¬ 
ing  would  be  an  effective  way  to  master  such  a 
task  domain.  For  instance,  case-based  reason¬ 
ing  might  be  more  effective  than  principle-based 
reasoning  for  certain  design  tasks  (e.g.,  cook¬ 
ing),  in  which  case  a  student  might  be  better  off 
practicing  wholesale  analogical  problem  solving 

’Actually,  there  ate  choices  to  make  during  step  3  re¬ 
garding  how  to  rotate  the  coordinate  axes  or  whether  to 
omit  certain  equations.  These  choices  affect  the  difficulty 
of  step  4,  but  not  the  ultimate  outcome. 
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rather  than  rule-based  problem  solving.  How¬ 
ever,  case-based  or  analogical  problem  solving  in 
mathematical  analysis  task  domains  often  pro¬ 
duces  only  near  transfer,  whereas  principle-based 
reasoning  can  produce  both  near  and  far  transfer 
(e.g.,  Reed,  1989;  Sweller  &  Cooper,  1985).  Al¬ 
though  these  studies  did  not  distinguish  transfer 
of  system  defining  knowledge  (presumably  case- 
based)  from  transfer  of  principles,  it  nonethe¬ 
less  seems  plausible  that  principle-based  reason¬ 
ing  is  more  effective  than  case-based  reasoning 
during  steps  2  and  3  simply  because  it  is  pos¬ 
sible  to  generate  problems,  such  as  the  ones 
in  Larkin  (1983),  that  are  quite  dissimilar  to 
textbook  problems,  thus  thwarting  or  at  least 
complicating  case-based  reasoning,  and  yet  are 
amenable  to  solution  by  first  principles.  In  short, 
it  seems  that  the  target  knowledge  for  steps  2 
and  3  should  be  principles,  and  not  cases  nor 
search  control. 

It  is  worth  recalling  that  our  top-level  goal  is 
to  determine  when  analogical  problem  solving  is 
advisable  for  effective  learning.  So  far,  we  have 
argued  that  analogy  should  be  used  during  the 
system  defining  step  and  that  principles  should 
be  used  instead  of  analogy  during  the  other  two 
steps.  The  next  step  in  the  argument  is  to  con¬ 
sider  how  principles  can  be  learned  during  steps 
2  and  3. 

We  can  safely  assume  that  the  learner  al¬ 
ready  knows  many  principles,  so  that  the  learn¬ 
ing  problem  is  to  detect  a  gap  (missing  principle) 
and  fill  it,  rather  than  to  learn  a  whole  batch 
of  principles  at  once.  In  order  to  detect  a  gap, 
one  must  use  principles  instead  of  analogies  to 
achieve  goal,  for  otherwise  the  body  of  knowledge 
containing  the  gap  will  not  be  referenced  and  de¬ 
tecting  the  gap  would  be  impossible.  Prindple- 
based  problem  solving  will  uncover  gaps  that 
cause  impasses,  but  not  all  gaps  cause  impasses. 
Thus,  it  is  a  good  idea  to  check  the  interme¬ 
diate  solutions  produced  by  principle-based  rea¬ 
soning,  because  early  detection  of  an  error  will 
facilitate  locating  the  gap  that  caused  it.  Anal¬ 
ogy  is  one  way  to  check  solutions.  By  consider¬ 
ing  the  nature  of  the  principle-based  reasoning, 
one  can  predict  when  solution-checking  is  espe¬ 
cially  important.  In  physics,  missing  knowledge 


of  physics  quantities  will  cause  step  2  to  pro¬ 
duce  too  few  forces,  energies,  etc.,  but  this  will 
not  causes  impasses  until  much  later,  if  at  all. 
Consequently,  it  is  a  wise  idea  to  use  analogi¬ 
cal  problem  solving  to  check  the  results  of  step 
2  before  moving  on  to  step  3.  This  is  just  what 
subject  S105  did  in  the  case  mentioned  earlier.^ 
In  general,  when  the  goal  is  "generate  all  X  that 
you  can  think  of,”  where  X  in  BIOS’s  case  is 
"forces,”  then  gaps  will  not  cause  impasses  so 
it  is  especially  important  to  check  the  complete¬ 
ness  of  the  set  of  generated  Xs,  and  analogy 
is  one  way  to  do  that.  An  experienced  learner 
of  mathematical  analyses  may  know  this  heuris¬ 
tic.  There  are  probably  other  heuristics  about 
when  to  check  in  order  to  detect  errors  early  via 
analogy.  Several  subjects,  for  instance,  routinely 
checked  their  equations*  signs  and  trigonometric 
functions  against  the  examples. 

Once  a  gap  is  detected,  analogy  is  certainly 
one  possible  way  to  fill  it,  but  it  is  not  easy  to 
predict  whether  one  should  use  analogy  or  some 
other  technique,  such  as  instantiating  an  overly 
general  rule  (VanLehn,  Jones  &  Chi,  1991)  or 
explanation  pattern  (Schank,  1986).  A  heuristic 
for  this  decision  would  be  hard  to  formulate. 

We  have  arrived  finally  at  our  goal,  which  are 
heuristics  for  deciding  when  to  use  analogical 
problem  solving.  The  heuristics  are: 

1.  If  the  task  domain,  or  some  part  of  the  task 
domain  (e.g.,  steps  2  and  3),  has  principles, 
and  they  are  more  effective  knowledge  than 
cases,  and  they  require  little  search  control, 
then  the  target  knowledge  should  be  princi¬ 
ples. 

2.  If  the  target  knowledge  is  principles,  they 
should  be  acquired  by  gap  filling,  which  im¬ 
plies: 

(a)  Gap  detection:  Try  to  use  principles 
instead  of  analogies,  as  a  gap  may  show 
up  as  an  impasse.  Use  analogy  to 
check  the  intermediate  residts  derived 

*I]i  Older  to  fill  such  gaps,  CMcade  3  used  analogy 
essentially  as  a  check  of  step  2,  although  the  implemen¬ 
tation  was  rather  baroque  (VanLehn  k  Jones,  1991;  in 
press-a). 
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via  principles,  as  this  may  uncover  gaps 
that  do  not  produce  impasses. 

(b)  Gap  filling;  Analogical  problem  solv¬ 
ing  is  one  possible  technique  for  filling 
gaps.  Others  should  be  considered  as 
well. 

3.  If  the  target  knowledge  is  not  principles 
but  cases  or  search  control,  then  analogical 
problem  solving  may  be  useful. 

This  conclusion  was  suggested  by  human  data. 
There  is  good  evidence  that  the  Poor  solvers 
use  analogy  “wholesale,”  as  a  replacement  for 
principle- based  solutions  to  steps  1,  2  and  3. 
There  is  also  good  evidence  that  Good  solvers 
avoid  smalogical  problem  solving  in  general. 
There  is  some  evidence,  albeit  only  a  few  pro¬ 
tocol  excerpts,  that  when  Good  solvers  do  use 
analogy,  it  is  used  as  an  aid  to  gap  filling.  In 
this  last  section,  we  have  reflected  on  the  human 
data  and,  supported  by  common  sense,  derived 
a  heuristic  for  when  analo^cal  problem  solving 
should  aid  learning.  This  heuristic  could  have 
several  uses.  (1)  It  helps  explains  why  the  Good 
solvers  learned  more  than  the  Poor  solvers.  (2)  It 
is  a  prescription  for  effective  learning  that  could 
perhaps  be  taught  to  human  students.  (3)  It 
could  be  embedded  in  a  multi-strategy  learning. 

Clearly,  all  these  implications  are  testable  in 
their  own  fashions.  A  good  next  step  in  the  re¬ 
search  would  be  to  build  a  multi-strategy  learner 
and  experimentally  confirm  that  the  heuristic 
does  increase  learning.  If  this  succeeds,  one  could 
try  teaching  human  students  this  heuristic,  as 
well  as  others  discovered  during  the  computa¬ 
tional  experiments,  and  see  if  their  learning  in¬ 
creases  as  predicted. 
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Abstract 

An  analytical  approach  to  multistrategy 
learning  is  presented.  Each  single  strat¬ 
egy  system  corresponds  to  a  classifier,  and 
a  multistrategy  system  corresponds  to  a 
combined  classifier.  Therefore,  an  ana¬ 
lytical  model  of  classification  with  many 
different  classifiers  is  developed  to  pre¬ 
dict  the  classification  accuracy  of  the  com¬ 
bined  classifier.  The  necessary  conditions 
for  the  improvement  of  classification  ac¬ 
curacy  are  determined  within  the  model. 
The  influence  of  mutual  dependence  of 
classifiei-s  is  studied  in  the  case  of  two 
classifiers.  It  is  shown  that  the  depen¬ 
dence  doesn't  effect  the  conditions  un¬ 
der  which  the  improvement  of  classifica¬ 
tion  accuracy  emerges.  The  model  is  also 
verified  for  two  different  systems  learning 
from  medical  data.  Fincilly,  some  special 
cases  are  anah^sed.  Our  analysis  shows 
that  the  optimal  number  of  classifiers  in 


combinations  strongly  depends  on  domain  ■ 
and  chosen  learning  edgorithms. 

Keywords:  knowledge  representation, 
machine  learning,  multiple  knowledge, 
multistrategy  learning 

1  Introduction 

In  the  case  of  noisy  and  incomplete  learn¬ 
ing  data  which  are  t^-piczd  in  real  life,  arti¬ 
ficial  intelligence  successfully  adopted  sev¬ 
ered  techniques  from  other  scientific  dis¬ 
ciplines,  e.g.  statistics.  For  example, 
through  empiriced  measurements  it  has 
been  shown  that  statistical  estimates  are 
inevitable  when  constructing  or  priming  a 
single  decision  tree  (Breiman  et  al.  .  1984; 
Mingers,  1989;  Holder,  1991).  During 
the  last  decade,  the  question  whether  to 
use  single-strategy  or  multistrategy  (inte¬ 
grated)  Iccirning  has  often  been  addressed 
by  many  authors. 
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For  example,  extensive  measurements  of 
mzmy  resezu-chers  in  the  field  of  empiri¬ 
cal  learning  have  shown  that  combining 
several  classifiers  results  in  better  results 
them  tuning,  no  matter  how  fine,  one  sin¬ 
gle  classifier  (e.g.  Buntine,  1990;  Clark 
emd  Boswell,  1991;  Gams,  1989).  Basic 
principles  of  this  approach  cem  be  foimd 
in  the  work  of  several  reseeirchers  (Brazdil 
et  al. ,  1991;  Handler  et  al.  ,  1991;  Michal- 
ski,  1987;  Minsky,  1991;  Tecucci,  1991). 
In  Bayesiem  emalysis,  it  has  been  shown 
that  it  is  better,  at  least  in  general,  to  de¬ 
sign  many  classifiers  and  combine  them  on 
the  beisis  of  probability  estimates  (Cheese- 
man,  1991).  The  results  which  support 
this  thesis  have  also  been  achieved  in 
several  related  areas,  e.g.  it  has  been 
shown  that,  when  overlapping  many  fil¬ 
tering  methods,  the  average  rating  signif¬ 
icantly  improves  with  the  g’-owing  number 
of  methods  (Foltz  and  Dumais,  1992). 

In  this  paper,  single-strategy  systems  cor¬ 
respond  to  single  classifiers,  and  a  mul¬ 
tistrategy  system  corresponds  to  a  com¬ 
bined  classifier,  the  combination  of  m  sin¬ 
gle  classifiers.  The  presented  work  at¬ 
tempts  to  enhance  the  grounding  of  multi¬ 
strategy  leeurning.  Namely,  the  proposed 
models  are  expected  to  predict  whether 

=  . 

Let  A*  be  a  measurement  space  and  C  a 
set  of  ail  possible  classes.  A  domain  is 
then  a  set  of  ordered  pairs 

DC  {(.t.c):feA-.c€C),  (2.2) 

and  a  classi&er  d  is  a  function 


the  multistrategy  approach  enables  the 
improvements  in  the  given  case  or  not.  In 
our  approach,  an  m^-parametric  model  is 
used  to  estimate  the  classification  accu¬ 
racy  of  a  combination  of  m  classifiers  and 
to  compeu'e  it  to  the  most  accurate  among 
them.  In  the  special  case  of  two  cleissifiers, 
the  effect  of  mutual  dependence  is  intro¬ 
duced  using  a  hybrid  cleissifier.  The  com¬ 
parison  of  two  classification  methods  on 
medical  domain  is  performed  emd  the  re¬ 
sults  eire  in  good  agreement  with  the  pre¬ 
dictions  of  our  model.  For  the  analysis 
of  behavior  of  more  classifiers,  our  model 
is  simplified  to  depend  on  two  pareuneters 
only,  so  that  we  can  study  the  influence 
of  the  properties  of  single  classifier  on  the 
performeince  of  combination. 

2  Classification  With  m  Clas¬ 
sifiers 

2.1  General  analysis 

In  the  following,  exeunples  are  described 
in  attribute-veJue  languages.  Each  ex¬ 
ample  belongs  to  exactly  one  class.  Let 
us  denote  JV  attributes  as  A\,...,An 
and  their  values  as  (Vi  i,  •  •  •  ,Vi  . . . , 
( Viv  1 ,  •  •  • ,  lOv  Mw  )•  The  measurement 
space  is  then  defined  eis 

X  {V/vi,...,Viv jvf/v}.  (2.1) 

Here,  3?  are  real  numbers,  c  denotes  the 
class  of  object  x  and  cf  is  a  confidence 
factor  of  classification. 

Let  us  now  consider  a  set  of  classifiers,  all 
mapping  from  the  same  X  to  the  seime  C, 
M  =  {di ,  di )  •  •  •  7  dn } .  There  are  different 
ways  of  combining  the  classification  of 


d:  A'-Cx3ff 

X  <-*  {d(x).c.d{x).cf) 


these  classifiers.  One  can  take  the  result  be  weighted.  In  this  presentation,  we  are 

of  clzLSsification  with  the  greatest  confi-  dealing  with  the  best-one  principle.  A 

dence  factor  (best-one  principle).  We  can  multiple  or  combined  classiBer  on  a  set  M , 

also  take  the  class  that  was  proposed  by  based  on  the  best-one  principle,  is  there- 

the  majority  of  classifiers  (majority  prin-  fore  defined  as 

ciple).  In  addition,  this  voting  can  also 

dM  ■  X  —*  C  X  ^ 

(dj(i).c,di(z).c/);  di{x).cf  =  max  {dj{x).cf} 

tti  €Af 


When  classifying  a  single  vector  £  6  X 
with  classifiers  from  A/,  each  of  them 
C2in  predict  either  correct  (success,  T) 
or  wrong  clziss  (failure,  F).  The  result 
of  the  multiple  classification  can  be  de¬ 
scribed  by  a  situation  vector  s  €  5,  where 
$  =  {T,  F}”*  and  ||A/||  =  m.  In  this  no¬ 
tation,  we  set 

T,  (£,  d,(x).c)  €  D 
F,  (£,  d,(f).c)  ^  D 

Let  us  now  determine  the  probability  of 
the  correct  classification  of  m  combined 
independent  classifiers.  For  each  situation 


s  6  5,  we  have  to  determine  the  probabil¬ 
ity  pj  of  its  occurrence  and  the  probability 
of  the  correct  classification  in  a  given  sit¬ 
uation 

In  a  given  domain  D,  the  probability  of 
correct  cleissification  of  the  :  —  th  classifier 
is  denoted  by  p,.  These  probabilities  play 
major  role  in  ovir  analysis.  We  will  assume 
0  <  Pi  <  1  to  avoid  trivial  results.  Under 
the  assumption  that  2dl  the  classifiers  are 
mutually  independent,  the  probability  of 
the  occurrence  of  the  situation  s  €  S  is 


n  ■  (2.5) 

J 


The  determination  of  gj-  is,  however, 
rather  complicated  and  depends  on  the 
combining  technique.  Since  we  use  the 
best-one  principle,  i.e.  the  result  of  multi¬ 
ple  classification  is  the  class  predicted  by 
the  classifier  with  the  greatest  confidence 
factor,  we  have  to  introduce  parameters 
Qij  in  our  model  to  denote  the  probability 
that  the  confidence  factor  of  the  i-th  clas¬ 
sifier  is  greater  thcin  the  confidence  factor 
of  the  j-th  clEissifier  in  a  situation  where 


the  first  one  succeeds  and  the  second  one 
fails 

qij  =F(d,(i).c/  >  dj{x).cf 

I.  -T  .  _ 

1*1  —  F,  j>j  —  F). 

Now  we  have  to  determine  the  probability, 
that  at  least  one  of  the  classifiers  di,Si  = 
T  has  its  confidence  factor  greater  than 
all  the  classifiers  =  F 


H 

Let  us  first  express  the  inner  factor 


= 


/\  {di{x).cf  >  dj{x).cf\si  =  T,  Sj  =  F) 


-  n 


(2.8) 


where  are  defined  in  (2.6).  In  a  special 
case  where  Sj  =  T  Vi,  we  define  9,/  =  1. 
Let  nj  be  the  number  of  T  elements  in 
Now  we  can  write  the 

i,  5i=T 

general  expression  for 

I  ~  n  ”  913)1  ^a  >  0  gj 
0,  n  j-  =  0. 

The  probability  of  correct  classification  of 
a  multiple  classifier  on  a  set  M  is  obtained 
as  a  stun  over  all  possible  situations 

Pm  ~  V  „  Pa9a'  (2.10) 

res 

This  sum  is  the  basis  of  further  analysis. 
2.2  Special  cases 

Let  us  now  talce  a  look  at  some  special 
cases.  In  the  case  of  2  independent  clas¬ 


sifiers,  the  4  possible  situations  are  pre¬ 
sented  in  Table  1. 


s 

Pa 

93 

(1  -Pl)(l  -P2) 

0 

(T,F) 

Pl(l  -P2) 

9l2 

(F,T) 

P2(l  -  Pi) 

921 

(T,r) 

P1P2 

1 

Table  1:  4  possible  situations  for  2  inde¬ 
pendent  classifiers 


For  the  probability  of  a  correct  classifica¬ 
tion  of  a  multiple  classifier  we  obtain 

PM  =  ^  PaQa 
f€S 

=  Pl(l  —  P2)qi2  +P2(1  —  Pl)921 
+  P1P2. 

(2.11) 

For  a  set  with  3  independent  classifier, 
we  obtaun  8  different  situations,  which  are 
shown  in  Table  2. 


5 

Pa 

<la 

(F,  F,  F) 

(T.  F.  F) 

(F,  r,  F) 

(F,  F,  T) 

(T,  T.  F) 

(T,  F,  T) 

(F,  r,  T) 

(T,  T,  T) 

(1  -Pi)(l  -P2)(l  -Pz) 
Pl(l  -P2)(l  -Pz) 

P2(l  -Pl)(l  -Ps) 

P3(l  -Pl)(l  -P2) 
PlP2(l  -pz) 

PlP3(l  -P2) 

P2P3(1  -Pl) 

P1P2P3 

0 

9l29l3 

921923 

<lZ\<lZ2 

9l3  +  923  —  913923 
9l2  +  932  —  912932 

921  +  931  —  921931 

1 

Table  2:  Overview  of  situations  for  3  clcissifiers 


In  our  model,  each  set  M  of  m  classifiers 
on  a  given  domaun  is  described  by  pa- 
rzirneters,  which  can  be  arranged  into  a 
matrix  m  X  m 


( 

9l2 

9lm  \ 

921 

P2 

•  ■  .  92m  ’ 

'9ml 

9m2 

The  upper  data  description  has  been  used 
for  all  our  anzJyses. 

2.3  Mutual  dependence  of  classifiers 

The  eissumption  that  classifiers  are  mu¬ 
tually  independent,  is  rarely  met  when 


dealing  with  read-life  domains.  The  ques¬ 
tion  is,  how  relevant  are  the  results  of  our 
model  in  case  of  dependent  classifiers.  Let 
us  again  assume  that  d\  and  d2  aure  two  in¬ 
dependent  classifiers  and  their  probabili¬ 
ties  of  correct  claissification  are  pi  and  p2, 
respectively.  A  hybrid  classifier  d^  is  then 
a  function  which  gives  the  same  result  as 
di  with  probability  d  and  the  same  result 
as  d2  with  probability  l  —  d.  Then  we  can 
form  a  multiple  classifier  with  di  amd  dj. 
The  results  are  presented  in  Table  3. 


i* 

Pi 

Pi,  d  =  0 

Pi,  d=l 

9i- 

{F,F) 

(1  -Pi)((l  -d)(l  -p2)-l-d) 

(1  -pi)(l  -P2) 

1  -  Pi 

0 

(T,  F) 

pi(l  -d)(l  -P2) 

Pi(l  -P2) 

0 

9l2 

{F,T) 

(1  -Pi)(l  -  d)p2 

(1  -pi)p2 

0 

921 

(T,T) 

Pi((l  -d)p2  -i-d) 

P1P2 

Pi 

1 

Table  3:  4  possible  situations  for  2  dependent  classifiers 


As  a  sum  over  ail  situations,  we  obtain 
Pm  —  PaQs 

s£S 

=  (1  -  d)pi(i  -P2)qi2 
+  {I  -  d)p2{l  -  Pi)q2i 
-I-  (1  -  d)pip2  +  dpi 
=  (1  -  d)pM  +  dp- 

For  d  =  0,  the  result  is  obviously  the  same 
cis  for  two  independent  classifiers.  For 
d  =  1.  the  classifiers  are  equal  and  the  ob¬ 
tained  result  is  equal  to  the  result  of  one 
classifier.  For  0  <  d  <  1.  the  classification 
accuracy  lays  between  pi  and  p.\/- 
we  assume,  without  loss  of  generality,  that 
Pi  >  p2.  it  holds  p,\/  >  Pi  =>  >  pi. 


Though  the  dependence  shrinks  the  gain 
of  classification  accuracy,  it  doesn’t  effect 
the  conditions  under  which  the  gain  in  ob¬ 
tained.  Similar  result  can  also  be  shown 
for  relative  gain  of  classification  accuracy 


— 


Pm  ~  Pi 
Pi 

_  (1  -d)pM  +dpi  -pi 

Pi 


(2.14) 


Pi 


=  (1  -  d)r^j. 


It  is  obvious,  that  the  mutual  dependence 
does  not  effect  the  sign  of  r  \i . 


3  Measurements  In  Medical 
Domain 

3.1  Description  of  data  and  classi¬ 
fiers 

For  our  measurements,  real-life  data  on 
patients  amd  their  coronary  disease  diag¬ 
noses  have  been  used.  Every  patient  was 
described  by  30  attributes  and  fell  into 
one  of  the  three  possible  classes.  The  set 
contained  112  patients  and  was  ten  times 
rzmdomly  partitioned  into  a  training  (80 
patients)  and  a  testing  set  (32  patients). 
The  two  chosen  methods  were  naive  Bayes 
using  an  m-estimate  (m=2)  for  proba¬ 
bilities  as  the  first  method  and  the  k- 
th  nearest  neighbor  (k=5)  as  the  second 


one.  Neither  of  methods  belongs  to  the  AI 
field,  but  they  are  often  used  when  com¬ 
paring  clzissification  accuracies  of  differ¬ 
ent  AI  methods  cis  references.  They  both 
return  the  class  distribution  for  a  given 
example.  In  our  experiments,  the  major 
class  was  returned  as  a  result  of  the  clas¬ 
sification,  and  its  probability  as  a  confi¬ 
dence  factor. 

3.2  Verification  of  the  model 

During  our  experiments,  we  have  mea¬ 
sured  the  classification  accuracies  p,,  the 
probabilities  of  occurances  of  aJl  situa¬ 
tions  pf  as  well  as  probabilities  of  correct 
classifications  in  given  situation  qj-.  The 
results  axe  presented  in  Table  4. 


— 

P 

« 

P» 

<li 

Pi 

P[F.  F) 

P(T.  F) 

P{F.T) 

P(r.r, 

9(r.  F) 

9(r.  T) 

0 

0  •  V  A  V 

0.844 

0.156 

0.000 

0.031 

0.813 

7 

0.000 

1 

0.875 

0.813 

0.125 

0.063 

0.000 

0.813 

1.000 

7 

2 

0.844 

0.844 

0.125 

0.031 

0.031 

0.813 

1.000 

1.000 

3 

0.781 

0.750 

0.219 

0.031 

0.000 

0.750 

0.000 

7 

4 

0.875 

0.844 

0.094 

0.063 

0.031 

0.813 

1.000 

0.000 

5 

0.813 

0.844 

0.156 

0.000 

0.031 

0.813 

7 

0.000 

6 

0.844 

0.875 

0.125 

0.000 

0.031 

0.844 

7 

0.000 

7 

0.875 

0.906 

0.094 

0.000 

0.031 

0.875 

7 

0.000 

8 

0.938 

0.969 

0.031 

0.000 

0.031 

0.938 

7 

1.000 

9 

0.844 

0.844 

0.094 

0.063 

0.063 

0.781 

1.000 

0.000 

X 

0.850 

0.853 

0.122 

0.025 

0.028 

0.825 

0.800 

0.250 

Table  4;  Measured  probabilities  pi,  p;  and  gj 


In  case  that  a  certain  situation  J*  doesn’t 
occur  at  all,  the  corresponding  conditional 
probability  cannot  be  meaningfully  es¬ 
timated  and  is  therefore  denoted  by 
From  Table  4  we  can  see  that  tlie  classifi¬ 
cation  accuracies  p\  and  p]  differ  on  aver¬ 


age  about  0.3%.  Both  classifiers  are  quite 
similar  and  they  also  use  the  Scime  data  for 
learning.  We  can  also  see  that  both  classi¬ 
fier  rarely  disagree  {p^T,  f)  +P( f.  T)  <  6%)- 
This  fact  indicates  a  high  level  of  mutual 
dependence  of  classifiers. 


37 


In  order  to  estimate  the  dependence  co¬ 
efficient  d  and  the  classification  accuracy 
of  the  virtual  independent  classifier  p2 ,  we 
fit  the  situation  probabilities  Pa  from  Ta¬ 
ble  3  and  the  accuracy  of  the  hybrid  clas¬ 
sifier  P2  =  dpi  -1-  (1  —  d)p2  to  the  up¬ 


per  measurements.  The  resulting  values 
of  d  and  p2  were  then  used  for  evalua¬ 
tion  of  (2.13).  The  predictions  of  (2.13) 
(pMicalc.))  are  compared  to  the  measured 
classification  accuracy  of  combined  classi¬ 
fier  (pAf(meas.))  in  Table  5. 


d 

P2 

PM{meas.) 

PMicalc.) 

0 

0.834 

1.000 

0.813 

0.813 

1 

0.928 

0.013 

0.875 

0.875 

2 

0.765 

0.844 

0.875 

0.875 

3 

0.958 

0.051 

0.750 

0.750 

4 

0.680 

0.777 

0.875 

0.875 

5 

0.834 

1.000 

0.813 

0.813 

6 

0.801 

1.000 

0.844 

0.844 

7 

0.752 

1.000 

0.875 

0.875 

8 

0.500 

1.000 

0.969 

0.969 

9 

0.524 

0.843 

0.844 

0.844 

X 

0.784 

0.864 

0.853 

0.853 

Table  5:  Dependence  d  and  comparison  of  model  and  measurements 


We  can  see  that  the  dependence  is  very 
high  (over  75%  in  average)  which  doesn’t 
promise  a  substanticd  gain  of  classifica¬ 
tion  accuracy.  In  two  cases  (psurtitions  1 
Eind  3),  the  fitting  procedure  chose  a  very 
small  number  for  pi  and  tried  to  compen¬ 
sate  this  with  larger  value  for  d.  For  other 
partitions,  however,  the  obtained  values  of 
d  seem  quite  rezilistic. 


The  gain  of  clcissification  accuracy  with 
respect  to  pi  and  pX  is  obtained  only  for 
partition  2  (compare  Table  4  and  Table 
5).  Why?  Let  us  cissume  that  both  clas¬ 
sification  accuracies  ai'e  equal.  pi  =  p2  = 
p.  Under  this  assumption,  the  equation 


(2.11)  is 

PM  =  ^  P»qs 

jes 

=  P(1  -P)(gi2  +92i)+P^ 

The  necessary  condition  for  relative  gEun 
is  then 


P 

(1  ~  P)(9i2  +92i)+p  —  1>0 
(1  ■”  p){<l\2  +  921  —  1)  >  0 
912  +921  >1 

Indeed,  this  condition  is  met  only  for  par¬ 
tition  2.  The  correlation  between  the  cor¬ 
rect  classification  and  confidence  factor  is 
certainly  a  vital  condition  for  the  success 
of  a  combined  classifier. 


A  closer  look  at  the  Table  4  explains  the 
reason  for  the  failure:  the  first  classi¬ 
fier  (naive  Bayes)  seems  to  be  too  self- 
confident  as  we  can  see  from  the  values 
of  9/.  Even  if  it  feiils,  its  confidence  factor 
is  greater  than  the  one  of  the  second  clas¬ 
sifier  (k-th  nearest  neighbor).  Therefore, 
the  simple  best-one  mechanism  doesn’t 
behave  fairly  and  the  failure  of  the  first 
classifier  easily  superseeds  the  success  of 
the  second  one.  It  is  important  to  notice, 
that  the  gain  of  classification  accuracy  can 
be  expected  only  for  more  balanced  values 
of  q^. 

4  A  Simplified  Model  Of  m 
Classifiers 

For  the  analysis  of  behavior  of  more  clas¬ 
sifiers,  our  model  is  rather  hard  to  use, 
since  it  describes  a  system  of  m  classifiers 
with  m?  parameters.  Therefore  we  have 
to  make  a  sound  simplification  to  study 
the  influence  of  the  properties  of  a  sin¬ 
gle  classifier  on  the  performance  of  com¬ 
bination  for  m  >  2.  First,  we  assumed 


that  all  classification  accuracies  pi  are  the 
same.  We  base  this  assumption  on  the  ob¬ 
servation  that  in  read-life  domains  (adso  in 
our  case)  the  accuracies  of  different  clas¬ 
sifiers  do  not  differ  very  much.  Second, 
we  also  assumed  that  all  the  conditional 
probabilities  of  successful  multiple  classi¬ 
fication  gij  are  the  same.  This  is  defirdtely 
not  true  in  our  case.  However,  we  claim 
that  one  of  the  reasons  for  this  is  a  rela¬ 
tive  small  number  of  situations  where  the 
classifiers  disagree  amd  our  results  for  qij 
aren’t  significamt.  Furthermore,  as  men¬ 
tioned  above,  these  parameters  should  be 
more  badanced  to  obtain  a  fair  behavior 
of  the  best-one  combining  method.  And 
finally,  it  seems  that,  in  situations  of  prac¬ 
tical  interest  for  the  use  of  multistrategy 
learning,  these  parameters  would  depend 
more  on  the  domain  than  on  the  classify¬ 
ing  methods. 

Under  the  above  assumptions,  the  equa¬ 
tions  (2.10)  gives  us  the  following  results 
for  relative  gadns  (p  >  0): 


m  =  2;  VM  =  (p-  1)(1  -  2q) 

m  =  3;  r^f  =  (p-  1)(1  +  p-6pg  -  3q'  +6pq') 

m  =  4;  =  (p-  1)(1  +  p  +  p^  -  12p'g  -  12pg^  -I-  24p"5^ 

-  4q^  +Spq^  -  Sp'q^  4-6pq'*  -  6p'q^) 

m  =  5:  /'A/  =  ip  -  1)(1  -I-  p  -f-  p”  +  p^  -  20p^g  -  ZOp^q'  -I-  60p^q~  —  20pq^  -I-  40p"?^ 

—  40p^5^  —  5q^  +  Ibpq^  +  I5p'q^  —  20p^q*  -I-  lOpg®  —  ZOp'q^  -b  20p^q^) 


Since  we  simplified  our  model  to  depend 
on  two  parameters  only,  the  results  can  be 
presented  in  3D  space.  The  upper  func¬ 
tions  are  shown  in  Figure  1.  Only  where 


they  aire  positive,  the  gain  of  clcissification 
accuracy  can  be  expected. 


Figure  1:  Dependence  of  r^j  on  p  euid  q 


The  line  in  Figure  1  indicates  the  border 
between  the  areas  of  positive  tind  negative 
gain.  We  czin  clearly  see  that  the  improve¬ 
ment  of  clcissification  accuracy  cannot  be 
expected  for  small  values  of  q.  Also,  the 
relative  improvement  is  obviously  greater 
for  smaller  values  of  p. 


It  seems  that  the  relative  gains  for  q  = 
1  and  very  smadl  classification  accuracy 
grows  line3u-ly  with  the  number  of  classi¬ 
fiers  in  combination  m.  Let  us  again  ex¬ 
amine  (2.10).  If  q  is  set  to  1,  till  the  q, 
except  the  first  one  are  also  equal  to  1. 


In  the  limit  p  —*  0,  only  the  terms  pj  of 
the  form  p  +  ...  are  of  interest.  Since 
only  the  probabilities  of  situations  with 
one  T  component  are  of  this  form  and 
there  are  m  such  situations  (not  taking 
into  accoimt  the  situation  without  T  com¬ 
ponents,  where  5/  =  0),  the  result  is 

PM  =  mp+  ...^ 
as  we  expected. 

5  Conclusions 

The  analysis  of  multistrategy  learning  in 
the  form  of  multiple  classification  has 
been  presented.  An  analytical  model  has 
been  developed  to  predict  the  classifica¬ 
tion  accuracy  of  the  combination  of  differ¬ 
ent  classifier.  The  analytical  results  were 
compared  to  the  measurements  of  classifi¬ 
cation  of  tho  methods  on  real-life  domain. 
The  average  error  of  the  model  prediction 
was  under  1%,  so  it  seems  that  we  can  rely 
on  it  if  we  want  to  determine  whether  to 
use  the  multistrategy  learning  or  not  in 
the  given  situations. 

The  assumption  of  mutual  independence 
of  different  classifiers  is  rarely  met  in 
real  life,  since  all  the  classifiers  typically 
use  the  same  set  of  examples  for  learn¬ 
ing.  Therefore  we  introduced  the  depen¬ 
dence  pzirameter  d  in  our  model  to  anal¬ 
yse  the  influence  of  mutual  dependence  on 
the  classification  accuracy  of  the  multi¬ 
ple  classifier  at  least  in  the  case  of  two 
clcissifiers.  We  showed  that  the  depen¬ 
dence  shrinks  the  expected  gain  (positive 
or  negative)  of  classification  accuracy  by 
factor  (1  -  d).  However,  it  does  not  ef¬ 
fect  the  conditions  under  which  we  ex¬ 
pect  the  improvement  of  classification  ac¬ 
curacy.  We  also  showed,  how  to  fit  our 


model  to  the  measiured  probabilities  to  es¬ 
timate  the  value  of  d  for  the  given  classi¬ 
fiers. 

Within  the  simplified  model  it  has  been 
shown  that  the  gain  of  the  classification 
acctnracy  grows  linearly  with  the  number 
of  combined  clsissifiers  if  their  accuracies 
are  small  and  the  probabilities  for  the 
success  of  combination  are  high.  How¬ 
ever,  when  the  accuracies  of  single  classi¬ 
fiers  zire  high,  the  expected  gain  decreases 
and  blindly  adding  new  classifiers  into  the 
combination  doesn’t  seem  to  be  the  right 
method. 

Our  work  presents  new  indications  that 
the  current  learning  techniques  can  be 
substantially  improved  by  the  use  of  mul¬ 
tiple  knowledge.  However,  it  also  indi¬ 
cates  that  the  improvement  appears  only 
under  certain  conditions,  which  have  to  be 
kept  in  mind  during  the  development  and 
implementation  of  new  combimng  meth¬ 
ods. 
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Abstract 

In  this  paper  we  will  discuss  the  significance 
of  memory  and  reflection  in  integrated  learn¬ 
ing  architectures.  The  Massive  Memory  Ar¬ 
chitecture,  a  uniform  architecture  based  on 
episodic  memory  and  case-based  reasoning,  is 
described.  Its  reflective  capabilities  are  de¬ 
scribed  and  we  put  forth  the  hypothesis  that 
learning  methods  are  inference  methods  with 
reflective  capabilities,  i.e.  methods  requiring  a 
self-model  of  the  system.  Self-models  and 
method  implementation  are  based  on  concep¬ 
tual,  knowledge-level  descriptions  of  infer¬ 
ence.  We  show  how  the  MMA  reflective  ca¬ 
pabilities  can  be  used  for  integrating  learning 
and  problem  solving. 

1  Introduction 

In  this  paper  we  will  discuss  the  significance 
of  memo^  and  reflection  in  integrated  learn¬ 
ing  architectures.  The  role  of  memory  in 
learning  systems  is  widely  recognized,  as  by 
Michalsld’s  “equation”:  learning  ■  infer¬ 
ence  -f  memory.  This  equation  is  a  summary 
of  the  basic  abilities  a  learning  system  has  to 
have  in  order  to  be  able  to  learn:  the  reasoning 
ability  and  the  memory  storage  and  retrieved 
ability.  In  the  next  section  we  will  introduce 
the  Massive  Memory  Architecture  (MMA),  an 
integrated  learning  architecture  based  on 
episodic  memory  we  are  developing.  We 
claim  there  is  a  second  crucial  issue  in 
learning  architectures  that  is  not  so  widely 
recognized,  namely  that  of  reflection.  In  order 
to  justify  this  approach  we  just  have  to  realize 
that  a  system  that  has  to  learn  of  its  own  expe¬ 
rience  has  to  be  able  to  inspect  its  own  behav¬ 


ior  (that  has  to  be  represented  and  stored  in  its 
memory),  analyze  it  and  discover  what  aq)ect 
is  responsible  for  a  failure  (or  a  success,  or  a 
delay,  etc)  and  decide  how  to  transform  itself 
(its  knowledge,  procedures  for  decision,  etc) 
so  that  its  future  behavior  is  to  be  consisted 
more  appropriate  (or  adapted).  All  this  is  fairly 
general,  but  the  important  think  is  the  necessi¬ 
ty  of  the  system  to  be  able  to  self-inspection 
and  self-modification,  i.e.  the  capability  of 
reflection. 

Reflection  capabilities  have  been  an  issue  with 
long  tradition  in  mathematics,  philosophy, 
logic,  linguistics  and  computer  science.  These 
capabilities  can  be  formally  studied  since  Fe- 
ferman’s  (1962)  work  on  r^ection  principles 
and  computational  approaches  exist  since  FOL 
(Weyhrauch  89)  and  3-LISP  (Smith  85).  For 
our  purposes,  reflection  can  used  to  think 
of  learning  methods  as  a  kind  of  inference  or 
reasoning  able  to  introspect  into  the  systems 
representation  of  its  behavior  and  mo^fy  die 
system  structure  and  future  behavior  in  an 
appropriate  way.  Therefore  we  will  consider 
learning  as  an  inference  of  reflective  nature 
that  uses  a  self-model  of  the  system.  This 
reflective  nature  means  that  learning  is  a 
system’s  component  able  to  self-inspect  and 
self-modify  the  system  itself.  In  order  to  infer 
new  decisions  flom  the  results  and  behavior  of 
other  inference  processes,  those  results  and 
behavior  have  to  be  represented  and  stored  in 
the  memory  for  the  learning  inference  to  be 
able  to  work  with  them. 

A  third  characteristic  is  the  scheme  of  repre¬ 
sentation  used  for  memory  and  inference.  We 
do  not  follow  logic-oriented  formalisms,  but 
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the  conceptual  or  **knowledge-lever  frame¬ 
works,  like  KADS  (Wielinga  92)  or  Commet 
(Steels  90)  developed  for  analyzing  and  repre¬ 
senting  complex  reasoning  in  expert  systems, 
hi  these  frameworks  reasoning  is  represented 
in  terms  of  tasks  (or  goals),  methods  that  may 
achieve  them,  the  subtasks  needed  to  realize 
methods,  and  the  knowledge  or  models  used 
by  those  methods.  As  we  will  see,  this  ap¬ 
proach  allows  us  (1)  to  analyze  learning  meth¬ 
ods  as  a  form  of  complex  reasoning  ba^  on 
task/method  decomposition,  and  (2)  represent 
and  implement  learning  methods  in  a  uniform 
way  using  this  task/me&od  decomposition  and 
(3)  integrate  them  in  a  uniform  architecture 
using  reflection  principles  to  relate  learning 
with  problem  solving. 

2  The  Massive  Memory  Architecture 

The  MMA  is  an  experimental  framework  for 
experience-based  learning  and  reasoning.  It  is 
based  on  memorisation  of  past  episodes  of 
pioblem  solving  and  in  a  default  behavior  that 
resorts  to  analogous  past  cases  (precedents)  to 
solve  new  situations.  This  is  a  default  behavior 
in  the  sense  that  it  is  used  when  no  concrete 
domain  knowledge  is  available.  The  analogi¬ 
cal  inference  is  modelled  as  an  inference  pat¬ 
tern  Retrieve/Select/Adapt.  This  pattern  is  rei¬ 
fied  into  an  analogical  inference  method  ob¬ 
ject,  where  different  retrieve  or  select  methods 
can  be  used  that  are  domain  dependent  The 
fact  that  inference  methods  are  frrst  class  ob¬ 
jects  means  that  inference  methods  can  be  pro¬ 
grammed  also.  Inference  methods  in  MMA 
are  methods  diat  follow  a  Retrieve  /  Select  / 
Adapt  pattern.  Thus,  an  inference  method  is  a 
reification  of  the  basic  inference  pattern  of  the 
architecture.  Analogical  methods  are  inference 
methods  that  follow  a  Retrieve  by  similarity 
methods  and  then  may  have  Select  methods 
also  of  similarity  or  using  domain-based, 
knowledge-intensive  methods.  Inheritance  is 
also  represented  and  implemented  by  explicit 
inference  methods  that  use  a  retrieve  method 
that  follows  a  link  (e.g.  the  type  link,  but 
other  inheritance  methods  are  used,  like  the 
species  link).  Thus,  analogy  and  inheritance 
are  integrated  as  patterns  of  search  in  memory. 

Every  episode  of  problem  solving  of  MMA  is 
represented  and  stored  as  an  episode  in  mem¬ 
ory.  This  is  the  main  point  of  the  reification 


process:  create  die  objects  that  can  be  usable 
for  learning  and  improving  future  behavior. 
MMA  records  memories  of  successes  and 
failures  of  using  methods  for  solving  tasks 
Since  inference  methods  are  also  methods, 
learning  can  be  implied  to  different  types  of 
retrieve  methods  and  selection  method  used 
in  the  process  of  searching  and  selecting 
sources  of  knowledge  in  memory. 


Analogical  methods  are  inference  methods 
that  follow  a  task  decomposition  of  Retrieve  / 
Select  /  Adapt  Since  different  methods  can  be 
used  for  these  subtasks,  multiple  methods  of 
case  based  reasoning  can  be  integrated.  The 
characteristic  of  analogical  methods  is  that  the 
Retrieve  method  uses  a  similarity-based 
method.  Select  methods  can  also  be  based  on 
similarity  or  can  be  domain-based,  knowl¬ 
edge-intensive  methods.  All  inference  meth¬ 
ods  are  such  because  they  are  able  to  search 
for  sources  from  which  some  knowledge  may 
be  retrieved.  The  types  of  knowledge  retrieved 
is  either  domain  Imowledge  (as  methods)  and 
experiential  knowledge  (situations  of  failure 
and  success).  Experientid  knowledge  is  used 
by  MMA  to  bias  the  preferences  of  future  ac¬ 
tions  using  precedent  cases  stored  in  past 
episodes.  The  uniform  nature  of  MMA  (all 
representation  is  in  the  form  of  slots  in  ob¬ 
jects)  supports  learning  at  all  decision  points 
of  the  system. 


Z2  Elements  dT  inference 

Our  pinpose  is  then  to  reify  the  problem  solv¬ 
ing  process  into  a  collection  of  abstract  infer¬ 
ence  components.  We  call  this  inference-level 
reflection.  We  have  develop^  NOOS,  a  frame- 
based  language  with  reflective  capabilities  to 
impleimnt  the  Massive  Memory  Architecture. 
In  NOOS,  those  elements  are  tasks  (or  goals) 
and  methods  (or  ways  of  achieving  a  goal), 
and  theories.  Therefore,  all  problem  solving  in 
a  domain  will  be  by  means  of  a  task  to  be 
solved  and  the  methods  that  can  be  relevant  to 
solve  it  Moreover,  if  a  method  does  not  di¬ 
rectly  solve  a  task,  it  may  induce  subtasks  that 
neiMl  to  be  solved.  The  NOOS  approach  is  uni¬ 
form,  and  this  entails  that  the  problem  solving 
process  is  also  described  in  the  system  in 
terms  of  tasks  and  methods.  For  instance,  if 
there  is  no  method  specified  for  solving  a 
given  task,  the  task  of  the  problem  solving 
process  is  to  frnd  such  a  method;  or  if  there 
are  more  than  one  method  that  can  possibly 
solve  a  task,  a  task  of  problem  solving  pro¬ 
cess  is  to  choose  among  them.  A  way  to  do  it 
is  trying  them  out  until  one  works:  that  would 
be  a  method  for  such  a  task.  A  task  is  engaged 
when  exists  a  query  asking  the  filler  of  a  slot, 
expressed  as  (»  F  of  U)  in  NOOS  syntax 
and  F(U)  in  abstract  syntax. 

The  third  component  of  our  framework  for  in¬ 
ference-level  reflection  arc  theories.  Knowl¬ 
edge  cannot  be  simply  represented  as  an  unor¬ 
ganised  bag  of  axioms  lest  the  language  can¬ 
cels  its  capability  of  manipulating  different 
theories.  In  NOOS,  every  theory  is  reified  into 
an  object  of  the  language.  For  instance,  the 
person-theory  object  reifies  the  theory  of 
what  we  know  about  persons,  and  we  can  have 
another  theory  of  what  we  know  is  typically 
true  for  persons  reified  in  the  typical -per¬ 
son-theory  object.  Certainly,  that  leaves 
open  the  issue  of  how  these  two  theories  relate 
to  each  other,  i.e.  how  to  use  and  manipulate 
them  appropriately.  In  the  uniform  approach 
of  NOOS,  Acre  is  an  inference  theory  that 
specifies  how  these  domain  theories  are 
treated  when  trying  to  solve  some  problem 
about  persons.  Relations  among  theories  can 
be  stat^  in  the  usual  way,  as  in  the  assertion 
Default-Theory(person-theory)  = 

»  typical-person-theory 


where  we  state  that  the  typical -person- 
theory  contains  what  we  know  by  ddfault 
about  persons.  Moreover,  an  inference  theory 
may  say,  e.g.  that  in  order  to  infer  some 
proposition  P(John)  we  should  use  a  default 
theory  typicol-person-theory  only  after 
not  being  able  to  infer  P(John)  using  the  main 
theory  person- theory,  where  the  Defoul t- 
Theory  relation  is  used  to  prefer  one  theory 
before  the  other  according  to  the  current  sit¬ 
uation.  This  knowledge  is  contained  in  the 
person-inference-theory  object 

The  basic  inference  process  of  NOOS  follows 
the  Retrieve/Select/Adapt  pattern.  The  other 
notion  needed  to  explain  the  inference  process 
are  impasses.  When  a  query  for  a  slot  (» 
father  of  John)  is  evaluated  a  new  task  is 
started.  Then  either 

(i)  task  "fatherCJohn)”  has  a  method  like 
(»  husband  mother  of  self),  or 

(ii)  a  no-method  impasse  occurs. 

Case  (i)  is  called  spontaneous  inference  and 
occurs  at  the  base  level.  However,  in  (ii)  the 
impasse  causes  NCXJS  to  search  at  a  meta-level 
for  possible  methods  to  use.  Impasses  are 
handled  by  metaobjects,  that  is  to  say  MMA  is 
an  impasse-driven  reflective  architecture.  The 
architecture  specifies  which  types  impasses 
can  appear,  and  which  kind  of  metaobject  will 
handle  them.  The  no -method  impasse  in  a  slot 
F(U)  is  handled  by  its  metafunction.  There  the 
applicable  methods  can  be  retrieved  and  se¬ 
lected  (maybe  trying  them  out)  and  the  solu¬ 
tion  is  cached  in  the  slot  (see  Bg.  2).  Every 
impasse  is  an  opportunity  for  learning  and  the 
reification  process  creates  and  stores  the  ob¬ 
jects  needed  to  represent  the  situation  (so  that 
it  can  be  useful  in  the  future).  In  the  slot  ex¬ 
ample,  the  information  stored  is  die  successful 
method  and  the  methods  tried  that  failed.  The 
inference  can  be  more  complex,  e.g.  maybe 
the  applicable  methods  for  slot  query 
(»  father  of  John)  are  unknown.  This  is 
a  new  kind  of  impasse:  the  no-metafunction 
impasse  and  is  handled  by  the  metatheory  of 
John  that  possesses  inference  methods  able  to 
search,  retrieve  and  select  methods  in  other 
objects.  The  point  to  notice  is  that  the  uni¬ 
formity  of  NOOS  treats  all  situations  in  the 
same  way.  Ar  in  Soar,  eve^  impasse  arises 
from  lack  of  knowledge:  either  because  the 
system  does  not  know  what  to  do,  or  it  has 
several  possibilities  to  act  and  has  to  decide 


among  them.  The  first  type  of  impasses  is 
handled  by  inference  meth^s  that  know  how 
to  retrieve  sources  of  knowledge.  Multiple 
possibilities  are  handled  by  strategic  cliches, 
objects  that  know*  about  preferencing  and  se¬ 
lecting  among  choices. 

3  Reflection  and  self-models 

We  will  first  justify  the  claim  that  learning  is  a 
type  of  meta-level  inference,  arguing  that 
learning  requires  a  self-model  of  the  system. 
Then,  we  will  explain  the  processes  of  reifica¬ 
tion  and  reflection.  Self-models  are  required 
because  of  the  integration  of  learning  methods 
with  a  problem  solving  system.  In  general,  a 
learning  method  has  to  have  a  model  of  what 
are  “successes”  and  “failures”  in  the  architec¬ 
ture,  and  of  other  relevant  concepts  for  learn¬ 
ing  (e.  g.  the  SOLE-ALTERNATIVE  concept  in 
EBL-PRODIGY).  These  concepts  are  part  of 
the  learning  self-model  of  the  architecture. 
These  models  depend  upon  what  is  needed  by 
the  method,  i.e.  they  are  different  models  for 
different  learning  methods  (this  is  called 
“white-box  requirement”  in  (Carbonell  91), 
meaning  that  any  ML  method  has  to  be  able  to 
view  and  represent  what  it  requires  of  the 
problem  solver).  Moreover,  the  learning 
method  needs  to  be  able  to  effectively  inspect 
part  of  the  structure  and  behavior  (state)  of  the 
architecture,  and  interpret  that  into  its  method- 
specific  model.  Therefore,  learning  can  be 
viewed  a  type  of  meta-level  inference.  A  meta¬ 
level  inference  is  a  kind  of  inference  able  to 
inspect  (to  have  a  model  of)  the  base-level,  in¬ 


fer  some  new  decision,  and  mod¬ 
ify  the  base-level  in  such  a  way 
that  it  complies  to  that  decision 
(Smith  85).  The  Massive  Memory 
Architecture  has  a  model  of  the 
usage  of  methods  for  solving 
tasks.  MMA  self-model  for  case- 
based  reasoning  consists  of  the 
decisions,  successes  and  failures 
of  methods  that  are  declaratively 
recorded  in  die  system’s  memory 
of  cases.  This  model  is  reified  in 
MMA,  i.e.  those  concepts  are 
computational  objects  accessible 
to  the  system.  This  model  allows 
MMA  to  retrieve  methods  already 
used  for  similar  problems  and  to 
control  search  by  analogical 
transfer  of  past  decisions  in  similar  situations 
to  the  current  problem.  The  meta-level  issue  is 
also  implicit  in  ILT,  the  inferential  learning 
theory  (Michalski  91).  In  ILT  learning  meth¬ 
ods  are  analysed  as  higher-level  inference  pat¬ 
terns  the  result  of  which  are  “knowledge 
transmutations”,  i.e.  the  modification  of  the 
system’s  knowlnlge  as  mandated  by  the  infer¬ 
ence  performed  by  the  learning  method. 

3.1  Reification  and  reflection 

The  reflection  principles  specify  the  relation¬ 
ship  between  a  theory  T  and  its  meta-theory 
TVT.  The  upward  principles  specify  the  reifi¬ 
cation  process  that  encodes  some  aspects  of  T 
into  ground  facts  of  T*tT.  That  is  to  say,  reifi¬ 
cation  constructs  a  particular  model  of  T  in  the 
language  used  by  TIT.  The  nature  of  reifica¬ 
tion  and  the  model  constructed  is  open,  i.e.  it 
depends  on  the  purpose  for  which  the  cotfing 
is  made.  We  will  use  in  MMA  a  knowledge- 
level  model  of  task/method/theory  decompo¬ 
sition  (explained  in  §2)  as  a  meta-model  of  the 
base-level  inference.  We  follow  a  framework 
similar  to  the  Components  of  Expertise  (Steels 
90),  A  similar  approach  is  taken  in  (Akker- 
mans  92)  where  the  meta-model  is  the  KADS 
modelling  frameworic  (Wielinga  92)  for  expert 
systems.  However,  we  do  not  follow  them 
strictly,  except  in  the  general  idea  of  using  as 
“elements  of  inference”  goals,  methods,  and 
theories.  The  meta-theory  TtT  conteris 
knowledge  that  allows  to  deduce  how  to  ex¬ 
tend  this  model  deducing  new  facts  about  it. 
This  deduction  process  is  called  meta-level  in- 
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Figure  3.  Reification  constructs  a  model  of  theory  T.  Metalevel 
inference  deduces  new  facts  or  takes  new  decisions  that  extend  or 
nuxlify  this  model  using  a  meta-theory  MT.  Finally,  r^ection  constructs 
a  new  theoiy  T  that  faithfully  realises  the  extendi  model  of  T. 


ference,  and  the  content  of  this  theory  is  again 
specific  to  the  purpose  at  hand  (the  meta-the- 
ory  is  indeed  no  more  than  a  theory).  Finally, 
downward  principles  specify  the  reflection 
process  that  given  a  new,  extended  model  of  T 
has  to  transform  the  theory  7  to  a  new  theory 
T’  that  complies  to  that  new  model.  A  more 
detailed  explanation  of  the  reflective  princi¬ 
ples  and  of  the  semantics  of  NOOS  can  be 
found  in  (Plaza  92a). 

32  Self-models  in  MMA 

Our  hypothesis  is  that  different  types  of 
learning  methods  would  require  different  self- 
models  of  the  architecture.  The  current  im¬ 
plementation  of  MMA  has  a  model  of  the 
methods  used  for  each  task:  methods  that 
have  been  proposed  (by  an  inference  method), 
methods  that  have  been  tried  but  failed,  and 
the  method  that  has  succeeded.  This  informa¬ 
tion  is  stored  in  an  object  called  slot-access. 
In  the  following  we  will  use  quotes  “X”  to 
designate  the  reification  of  X. _ 


Access-Nawe(“Ag«(John)*) 

->  “Age* 

Oawain(*Age(Johii)”) 

■> 

#<Johii> 

MetfwdCAgeOohn)*) 

■> 

*<Method  Age-wethod-3> 

Failed(«Age(John)") 

•> 

f^Method  Age-wethod-S> 

Referent(*Age(John)*) 

■> 

#<32-Years> 

This  self-model  is  used  by  inference  methods 
to  retrieve  and  transfer  the  metafunction  (con¬ 
taining  the  available  methods)  from  a  task 
solved  into  a  precedent  case  to  a  task  in  the 
present  problem,  and  for  inferring  preferences 
over  method  selection  based  on  their  success 
or  failure  in  those  precedents.  For  instance. 


the  MMA  can  obtain  the  method 
that  successfully  computed  the 
age  of  John  using  dtis  query: 

C»  wthod  reify  C»  ofl*  of  John)) 

»  f'44etlio4  Ago  wthod-^ 

Other  learning  methods  that  we 
are  incorporating  to  MMA  use 
this  self-model  but  also  require  its 
extension.  This  is  as  ocpet^  be¬ 
cause  of  our  self-models  hypothe¬ 
sis  implies  that  every  learning 
method  may  need  to  know  differ¬ 
ent  aspects  of  the  architecture. 
We  are  then  in  a  process  where  an 
analysis  of  those  learning  meth¬ 
ods  elucidate  which  aspects  of 
NOOS  that  are  hidden  or  internal 
to  its  implementation  are  to  be 
reified  and  made  accessible  to  the  architecture. 

4  A  diagnosis  example 

Let  us  show  the  dynamics  of  the  MMA  with  a 
simple  example  like  the  car  does-not-start 
diagnosis,  johns-car  gives  the  problem  data 
where  complaint  is  that  the  car  does  not  start 
and  the  task  is  diagnosis. 

Cdefine  johns-car 
(owner  john) 

(cowploint  does-not-start) 

(gas-level  full) 

(battery-voltage  low-voltage)) 

We  may  have  two  ways  of  solve  the  problem, 
one  is  the  knowledge-based  diagnosis  using  a 
generate  and  test  method,  and  the  other  one  is 
a  precedent-based  method.  The  knowledge  to 
diagnose  cars  is  in  car-g&t-diagnosis  but  it 
might  be  incomplete  (e.g.  not  always  can  gen¬ 
erate  all  possible  hypothesis).  For  this  reason 
the  inference  theory  for  johns-car  holds  a 
second  method  based  on  analogy  to  be  used 
when  the  flrst  method  fails.  The  preference  to 
use  first  car-g&t-diagnosis  is  based  in  the 
knowledge  that  is  a  stronger  method  than 
analogy.  If  the  MMA  did  not  have  this  knowl¬ 
edge,  it  would  choose  one  of  both  methods 
randomly,  and  after  solving  several  car  diag¬ 
noses  cases  it  could  base  its  preference  on 
their  successes  and  failures.  [Syntactic  remark: 
new  objects  or  slots  are  written  in  bold, 
(define  (foo  bar)  <body>)  creates  an  ob 
ject  (or  a  slot)  of  name  bar  with  body  <body> 
and  type  foo.  Anonymous  objects  are  desig¬ 
nated  by  its  relation  to  a  named  one,  e.g.  the 
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‘^metatheory  of  johns-car”  is  designated  as 

fmeto  reify  of  johns-car^l. _ 

Cdefiiw  Ctn^rvncc-thcory 

(Mta  rsify  of  johns-cor  )) 
Ccontonts  cor-<Kagnosis*analogy-oethod 

car-diognosis-inhcritance-oethod) 
(define  Gink-select  select) 

Gink  C»  reify  of  stronger-thon)))) 

(define  (theory  (theory  reify  of  johns-cor)) 

Ctypn  cor-g&t -diagnosis)) 

(define  (bosic-onalogy-ecthod 

cor-diognosis-onology-oethod  )) 

(define  (type-inheritonce-eethod 

cor-diognosis-inheritance-eethod) 
(stronger-thon  cor-diognosis-onalogy-oethod)) 

The  car-g&t-diagnosis  holds  the  knowl¬ 
edge  to  generate  malfunction  hypothesis  from 
complaints  and  test  those  hypotheses  checking 
the  facts  known  of  a  speciHc  car  the  sta¬ 

tus  of  battery  and  gas  tank).  It  also  holds  some 
preferences  to  choose  among  competing  hy¬ 
pothesis.  Hypothesis  selection  is  based  on 
knowledge  about  the  estimated  frequency  of 
malfunctions,  interpreting  the  tnope-fre- 
quent-than  relation  as  a  preference  relation 
to  select  the  current  hypothesis. _ 

(define  ear-gSt-diognosis 
(coogloint  ) 

(define  (conditional  bottery-lo«?) 

(condition 

(»  loN-voltoge  equal  battery-voltage)) 

(result  true)) 

(define  (conditional  no-gos?) 

(condition  (»  eopty  equal  gas-level))) 

(define  (generate-&-test  diagnosis  ) 

(define  (conploint-to-nal function-nap 

generote-hypotheses) 
(conplaint  (»  conplaint))) 

(define  (select-hypothesis  select) 

(link  (»  reify  of  nore-frequent-thon))) 
(define  (test -nol function  test-hypothesis) 
(device  (»))  ;  the  cor  itself 

(nolfunction  (»  select)))))  ;cur.  hypoth. 

The  generate-&-test  method  is  a  strategic 
cliche  that  retrieves  its  choice  set  by  means  of 
a  generate-hypothesis  method.  The  prece¬ 
dent-select  method  selects  among  the 
choice  set  using  a  method  that  retrieves  a 
precedent  case  and  prefers  the  hypothesis  in 
the  choice  set  according  to  their  result 
(succesrful/mtried/failed)  in  that  precedent. 
The  adapt  process  executes  the  test-hypothesis 
task  to  elucidate  whether  the  current  hypothe¬ 
ses  is  an  adequate  solution. 


(define  (strategic-cliche  generoto-a-tast) 
(genorota-hypothesas  ) 

(test-hypothesis  )  j 

(contents  (»  generote-hypotheses))  | 

(define  (precedent-select  select))  ;select  hyp.  | 
(adapt  (»  test-hypothesis)))  ;test  curr.  hyp. 

Generate-&-test  method  is  a  generic 
method  and  the  select  subtask  uses  a  fairly 
general  method.  More  specialized  generate- 
&-test  methods  can  be  written  with  more  fo¬ 
cused  selection.  In  car-g&t-diagnosis  we 
can  see  that  select-hypothesis  first  uses 
some  knowledge-intensive  criterion  Oike  ex¬ 
pected  frequency  of  malfunctions)  and  if  some 
unique  selection  cannot  be  achieved  then  a 
second  method,  precedent-select,  is  used 
to  choose  the  best  hypothesis. _ 

(define  (sequential-cliche  select-hypothesis  ) 
(link  ) 

(contents  (define  (link-preference) 

Gink  (»  link))) 

(define  (precedent-select  ))))) 

The  generate-hypothesis  task  uses  the 
complaint-to-mal function-map  method 
that  maps  each  complaint  in  our  theory  to  the 
known  set  possible  malfunctions  that  may 
cause  it.  This  form  of  domain  knowledge  is 
very  direct,  and  there  could  be  other  methods 
that  derive  this  mapping  from  a  causal  model. 

(define  (strategic-cliche 

coMplaint-to-oalfunction-aap) 
(conplaint  ) 

(contents  (»  plausible-hypotheses  cooplaint))) 

(define  (cooplaint  does-not-start-cooplaint  ) 
(plausible-hypotheses 

low-battery-oolfunction 
no-gos-ool function)) 

Each  car  malfunction  deHnes  the  test  that  can 
be  used  to  verify  it  effectively  occurs  in  a  de¬ 
vice,  a  repair  recommendation  and  different 
relationships  known  to  hold  among  car  mal¬ 
functions  (like  their  expected  frequency,  in  the 
example  below).  _ 

(define  (asHnUuKtioilow-battery-oalfiNiction) 

(test 

C»ieifyof(»  loirvoltq^  eqal  batteiy-valtagO)) 

(repair  recharge-car-battery) 

(oore-probable-than  starter-oal function)) 

(define  (conditional -oethod  test-oalfunction) 
(device  ) 

(oal function  ) 

(condition  (»  (»  test  oal  function)  device) 
(result  (»  oal  function))) 

Now  let  US  suppose  that  our  theory  about  car 
diagnosis  is  incomplete  (e.g.  not  all  mappings 
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complaints  to  malfunctions  are  com¬ 
plete),  and  that  this  is  the  case  with  johns- 
car.  The  failure  of  car-g&t-diagnosis  causes 
the  failure  of  the  first  inference  method  in 
metatheory  of  johns-car,  and  then  the  less 
preferred  prec^nt-based  method  is  selected. 
Retrieval  of  similar  cases  of  car  dia^osis  can 
now  retrieve  casuistry  information  given  while 
describing  particular  problem  solving  episo¬ 
des.  The  peters-car  precedent  case  asserts  a 
simple  causal  explanation  between  the  com¬ 
plaint  does -not- start  and  the  diagnosis 
known  for  that  problem.  Since  the  explanation 
of  this  case  is  in  form  of  a  method  for  diagno¬ 
sis,  car-diagnosis-analogy-method  re¬ 
trieves  and  applies  this  method  to  johns-car 
to  see  if  that  explanation  also  holds  there. 

Cdefine  peters-cor 
Ccoaplotnt  does-not-start) 

Cbattery-voltage  low-voltage) 

Cdefine  Ccausal -explanation  diagnosis  ) 

Ccause  C  » lonHxAtoge  egol  botbayvoltage)) 

Ceffect  low-battery-Ml  function)) 

Crepair  C»  repair  Malfunction  diagnosis))) 


5  Deliberate  learning 


The  memorization  of  the  episodes  in  problem 
solving  constitutes  the  spontaneous  and  ubiq¬ 
uitous  learning  method  used  by  MMA  and  re¬ 
quired  by  MMA  so  as  to  function.  We  are  cur¬ 
rently  experimenting  with  forms  of  deliberate 
learning  integrated  in  MMA.  Deliberate  learn¬ 
ing  is  made  of  learning  methods  (implemented 
as  regular  NOOS  methods)  that  have  to  be 
explicitly  called  to  be  executed  (typically, 
after  finishing  a  task).  The  main  purpose  of 
Plaza  and  Arcos  (1993)  is  to  show  Ae  integra¬ 
tion  of  memory  and  analogy  through  reflec¬ 
tion,  so  in  this  section  we  are  just  going  to 
sketch  some  of  the  essential  features  of  delib- 
erate  learning  in  MMA. 
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Figure  4.  Task  structure  of  a  deliberate  learning 
method.  Each  deliberate  learning  method  is  defined 
with  concrete  methods  for  those  subtasks. 


Deliberate  learning  methods  have  in  common 
a  charact^stic  ta^  decomposition  shown  in 
Fig.  4.  The  different  delibei^  learning  meth¬ 
ods  are  defined  as  an  Introspect/Construct 
/Incorporate  patten:  A  specific  method  is 
obtained  filling  the  task  decomposition  with 
specific  methods  for  each  subtask.  Difierent 
learning  methods  are  implemented  with  par¬ 
ticular  methods  for  (A)  selecting  a  training  set 
(Introspect)  from  objects  activated  while 
solving  a  task  in  NOOS,  (B)  for  constructing 
from  them  a  new  object,  and  (C)  for  incorpo¬ 
rate  that  object  to  the  rest  of  objects  in  the 
memory  in  an  appropriate  way. 

6  Related  work  and  Discussion 

Our  woik  on  architectures  is  related  to  cogni¬ 
tive  architectures  like  SOAR  (Newell  90), 
THEO  (Mitchell  91),  and  PRODIGY  (Carbonell 
91).  At  first  sight,  MMA  language  resembles 
THEO  since  NOOS  is  a  frame  language  with 
caching,  TMS,  and  “available  methods”  for 
slots.  However,  THEO  does  not  provide  a  clear 
metaobject  definition,  does  not  reason  about 
preferences  over  methods,  and  does  not  incor¬ 
porate  analogical  reasoning  or  explicit  infer¬ 
ence  methods.  At  a  deeper  level  MMA  resem¬ 
bles  Soar  in  that  MMA  is  a  uniform,  impasse- 
driven  architecture  with  a  built-in  learning 
method.  The  differences  are  that  spontaneous 
learning  here  is  episode  memorization  and  that 
our  “learning  as  metalevel  inference”  hypoth¬ 
esis  shapes  another  approach  to  inference  and 
learning  by  the  use  of  reification,  self-models 
and  the  explicit  representation  of  inference 
methods.  The  introspective  use  of  meta-expla- 
nations  in  Meta-AQUA  (Ram  et  al  92)  is  also 
related  to  MMA  approach  that  exploits  the  re¬ 
flective  approach  to  learning.  Meta-AQUA  is 
not  impasse-driven  but  proposes  a  mapping 
between  classes  of  situations  and  learning 
methods  that  can  improve  the  system.  Meta- 
Router  (Stroulia  92)  combines  planning  and 
case-based  reasoning  in  a  task-decomposition 
firamework  and  defines  a  typology  of  errors 
and  methods  for  repair.  Our  current  NOOS 
language  is  to  be  considered  a  descendant  of 
languages  RLL-1  (Greiner  and  Lenat  80)  and 
KRS  (van  Marcke  87). 

Related  work  on  knowledge-level  modelling 
of  AI  systems  includes  the  Commet  (or  com¬ 
ponents  of  expertise)  framework  (Steels  90), 
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and  dw  KADS  methodology  (Akkermans  92). 
Our  approach  is  closer  to  the  COMMET  in 
that  the  ontology  of  models,  tasks  and  meth¬ 
ods  proposed  by  COMMET  is  related  to 
MMA’s  ontology  of  theories,  methods  and 
tasks.  However,  NOOS  considers  two  layers: 
base-level  domain  theories  and  methods,  and 
meta-level  inference  theories  and  methods, 
while  the  Commet  approach  is  not  reflective 
and  only  is  concern^  with  the  domain  layer. 
The  KADS  methodology  is  much  more  differ¬ 
ent  but  they  have  used  a  reflective  ffamework 
to  describe  the  KADS  four-layer  architecture. 
However,  neither  Commet  nor  KADS  have 
been  used  to  perform  learning  tasks,  and  in 
fact  MMA  is  the  Erst  attempt  to  apply  knowl¬ 
edge  level  analysis  to  learning  tasks  and  to  de¬ 
velop  a  computational  architecture  that  em¬ 
bodies  that  approach. 

We  have  developed  an  architecture  based  on 
spontaneous  learning  by  memorization  of 
episodes  and  inference  based  on  transfer  from 
precedents.  We  are  currently  extending  it  to 
integrate  other  learning  methods.  Representa¬ 
tion  of  learning  methods  and  inference  meth¬ 
ods  is  uniform  and  based  on  conceptual 
frameworks  previously  used  in  the  analysis 
and  design  expert  systems  and  for  knowledge 
acquisition.  This  “elements  of  inference” 
model  allows  reflection  about  inference,  in¬ 
stead  of  previously  used  procedural  or  logic 
reflection.  Reflection  principles  have  been 
used  to  provide  a  computational  mechanism  to 
integrate  learning  and  inference  in  a  uniform 
representation.  Reflection  is  also  being  used  to 
model  in  a  principled  way  the  relationship 
between  learning  components  and  the  archi¬ 
tecture  as  a  whole. 
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Abstract 

Problems  facing  multistrategy  learning 
include  managing  the  complexity  of  multiple 
interacting  learning  components  and 
managing  the  cost  of  implementing  them. 
High-level  approaches  tc<  these  two  problems 
are  presented.  Analysis  of  the  task  domain 
can  generate  constraints  on  how  performance 
and  learning  components  interact  This 
analysis  can  be  verified  with  respect  to  human 
performance,  providing  a  standard  against 
which  to  measure  the  scope  of  learning  in  the 
system.  The  marginal  cost  of  adding  n'^w 
varieties  of  learning  to  a  system  can  be 
reduced  by  using  sharable  learning 
components,  like  those  embedded  in  general 
problem-solving  architectures.  These 
^proaches  require  a  meaningful  definition  of 
what  constitutes  learning  variety,  in  order  that 
systems  can  be  compared  and  that  system 
complexity  and  implementation  cost  can  be 
evaluated  with  respect  to  the  scope  of  learning 
that  the  system  achieves.  Two  constituents  of 
learning  scope,  density  and  diversity,  are 
proposed. 

Ke3rwords:  Multistrategy  learning,  human 
problem  solving,  task  analysis,  architectures 

1.  Introduction 

Multistrategy  learning  faces  some  hard 
problems.  One  is  that  of  getting  multiple 


kinds  of  learning  in  one  system  to  coordinate 
and  communicate  effectively.  Various 
specific  solutions  have  been  investigated, 
including  hard-wiring  control  of  the 
components  (Hammond,  1989;  Pazzani, 

1990)  and  opening  control  to  the  user  (Morik, 

1991) .  A  more  general  methodological 
problem  is  to  develop  constraints  on  system 
fiincdonality  that  help  shrink  the  design  space 
of  possible  solutions.  Emulating  humans’ 
ability  to  perform  and  learn  flexibly  is  one 
approach  to  developing  such  constraints. 

A  second  problem  in  multistrategy  learning  is 
the  cost  of  implementing  multiple  kinds  of 
learning  in  a  given  system.  Cost  increases 
with  the  variety  of  learning  forms,  reflecting 
the  need  to  implement  additional  unique 
learning  algorithms.  Lowering  the  marginal 
cost  of  adding  new  kinds  of  learning  to  a 
system  will  be  an  enabling  factor  in  increasing 
the  learning  scope  of  systems.  Architectural 
learning  components  will  play  a  role  in 
reducing  this  marginal  cost 

Bearing  on  both  these  problems  is  that  the 
notion  of  a  variety  of  kinds  of  learning,  or 
learning  scope,  should  be  a  well-defined, 
measurable  quantity.  Some  means  of 
evaluation  is  necessary.  The  learning  scope  of 
a  system  consists  of  at  least  two  measurable 
parameters,  density  and  diversity. 
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Section  2  defines  density  and  diversity. 
Section  3  proposes  that  a  useful  taxonomy  for 
measuring  diversity  derives  from  an  analysis 
of  the  domain  at  hand.  Section  4  gives  a 
sample  analysis  that  generates  such  a 
taxonomy.  Section  6  discusses  the  role  of 
shamble  learning  components  in  reducing 
implementation  cost  Section  7  draws 
together  implications  of  the  discussion. 

2.  A  Definition  of  Learning  Scope 

Learning  scope  has  at  least  two  constituents. 
First  scope  implies  learning  with  respect  to  as 
many  of  a  system’s  aspects  of  performance  as 
possible.  Tliis  quality  will  be  called  density. 
A  measure  for  density  is  the  proportion  of 
components  in  a  system  that  are  learnii^ 
targets,  where  a  learning  target  is  a  system 
component  (process  or  data  structure)  whose 
behavior  changes  over  time.  Density  reflects 
the  extent  to  which  learning  pervades  the 
system.  One-hundred  percent  density  would 
solve  the  "wandering  bottleneck"  problem 
(Mitchell,  1983),  in  which  a  performance 
bottleneck  arises  wherever  there  is  a 
component  that  is  not  a  target 

Learning  density  by  itself  fails  to  capture  at 
least  one  intuition  about  what  constitutes 
scope:  a  system  with  many  similar  learning 
targets  may  be  learning-dense,  but  in  a  way 
that  is  degenerate  in  terms  of  variety. 

Therefore,  a  second  constituent  of  scope  must 
measure  the  diversity  in  a  system’s  learning 
behavior.  Measuring  diversity  requires  an 
objective  standard,  which  should  be  external 
to  the  system  so  that  it  can  be  used  to  compare 
systems.  Such  a  standard  would  have  the 
character  of  a  taxonomy  that  categorizes  kinds 
of  learning.  Relative  to  such  a  taxonomy, 
diversity  is  the  extent  to  whidi  a  learning 
system  covers  die  taxonomy’s  categories. 
Diversity  complements  density,  providing  a 
means  of  evaluating  the  significance  of  having 
many  learning  targets. 


3.  Domain-Specific  Taxonomies 

A  taxonomy  for  measuring  learning  diversity 
can  be  derived  from  an  analysis  of  the 
system’s  domain.  This  measure  is  justified  by 
the  following  knowledge-level  perspective. 
Learning  components  generate  knowledge  for 
a  target.  A  target  is  a  body  of  performance 
knowledge  that  grows  with  the  outyut  of  some 
learning  component  Each  target  represents  a 
different  body  of  knowledge,  so  targets 
provide  a  dimension  of  knowledge-level 
variability.  This  dimension  constitutes  a 
measure  of  learning  diversity.  Moreover,  it 
explicitly  reflects  the  kinds  of  knowledge 
processed  through  learning,  complementing 
domain-independent  taxonomies  like  those  of 
Langley  (1987)  and  Carbonell  et  al  (1983). 

A  domain-specific  taxonomy  needs  to  be 
constructed  anew  for  each  domain,  using  task 
analysis.  One  methodology  is  as  follows.  The 
task  analyst  first  finds  features  that  the  domain 
affords  (offers)  for  problem  solving  methods 
to  manipulate.  Domains  differ  in  what 
features  they  afford.  For  example,  simple 
symbolic-logic  domains  afford  the  detection 
of  differences  between  current  and  desired 
states,  making  means-ends  analysis  (MEA) 
feasible;  in  chess,  on  the  other  hand,  the 
desired  state  is  too  vague  to  afford  difference 
detection,  so  MEA  is  not  feasible  (Newell  and 
Simon,  1972). 

With  affordances  in  hand,  the  analyst  then 
finds  methods  that  use  them  (such  as  MEA). 
Methods,  applied  in  a  particular  domain,  ate 
made  up  of  components  of  performance 
knowledge.  These  constituents  ate  learning 
targets.  The  collection  of  such  targets,  from 
all  the  methods  that  apply  to  the  domain,  form 
the  domain-specific  taxonomy. 

Learning  diversity  relative  to  this  taxonomy 
depends  on  performance  diversity;  the  mote 
varied  the  performance  components,  the  mote 
varied  the  set  of  learning  targets.  This  may 
seem  to  just  move  the  difficulty  from  the 
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Figure  1:  Simple  task  from  the  switchyard  domam 

measurement  of  learning  diversity  to  the  external  standard  of  proficiency  that  computer 
measurement  of  performance  diversity.  But  systems  have  not  yet  reached. 


task  analysis,  while  it  cannot  directly  specify 
how  a  system  will  learn  when  operating  in  a 
particular  task  environment,  can  specify  how  a 
system  must  perform  in  that  environment 
(Newell  and  Simon,  1972;  Wilde  and  Lewis, 
1991).  The  notion  that  task  analysis  can 
define  requirements  for  system  performance 
underlies  Anderson’s  rational-analysis  theory 
of  cognition  (Anderson,  1990).  Similarly, 
diversity  in  performance  as  a  response  to  a 
complex  environment  is  a  subtext  of  Minsky’s 
Society  of  Mind  (Minsky,  1986).  Thus  task 
analysis  has  a  history  of  being  used  to  analyze 
performance.  With  respect  to  learning, 
analyzing  performance  is  an  indirect  but 
operational  way  to  measure  learning  diversity. 

Moreover,  task  analysis  is  itself  amenable  to 
verification  against  an  external  standard. 
Human  performance  can  be  monitored 
experimentally  through  protocols  (Ericsson 
and  Simon,  1984),  and  analyzed  to  model  the 
underlying  computational  behavior  (Newell 
and  Simon,  1972).  Performance 
specifications  derived  from  task  analysis  can 
therefore  be  calibrated  against  models  of 
human  performance. 

This  calibration  is  relative  to  the  human  scale. 
While  this  scale  is  not  the  limit  of 
performance  and  learning,  it  provides  an 

capacity:  1  unit 
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4.  A  Sample  Domain-Specific 
Taxonomy 

This  section  gives  an  example  of  a  domain- 
specific  taxonomy  by  stetching  several 
problem-solving  methods  for  a  domain,  in 
enough  detail  to  show  high-level  components 
and  how  the  method  is  useful.  These  methods 
reveal  a  set  of  learning  targets  for  the  domain. 
In  total  there  are  five  methods  and  ten  targets. 

Tasks  in  this  domain  manipulate  trains  in  a 
railway  switchyard.  The  sample  task  analysis 
shows  how  affordances  in  the  domain  (such  as 
difference  detection)  are  manipulated  by 
methods,  and  how  even  a  simple,  knowledge- 
lean  domain  can  yield  a  variety  of  targets. 

Figure  1  shows  a  simple  task  in  which  the  two 
unshaded  cars  swap  positions.  The  solution 
path  is  included;  a  step  consists  of  a  train 
moving  from  one  position  Qeft,  right,  or 
siding)  to  another. 

Figure  2  shows  a  puzzle  task  (Delft  and 
Botermans,  1978),  in  which  two  trains  need  to 
pass  each  other  when  the  siding  is  too  small 
for  a  full  train.  The  optimal  solution  has 
sixteen  moves. 
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Figure  2:  Puzzle  task  from  the  switchyard  domain 


53 


Figure  4:  M|  is  necessary 


7.  Heuristic  search  (1  target).  Heuristics  ate 
knowledge  about  what  steps  ate  or  are  not 
usually  useful.  One  powerful  heuristic  for  the 
puzzle  task  prefers  moving  trains  with  one 
engine  to  moving  trains  with  two.  The  pool  of 
heuristic  knowledge  is  a  learning  target  A 
learning  component  for  this  target  could 
compile  heuristics  from  past  experience,  or 
specialize  them  from  domain-independent 
heuristics  in  the  manner  of  Eutisko  (Lenat 
1983). 

2.  Case-based  search  control  (3  targets). 
Case-based  reasoning  (CBR)  makes  use  of  a 
memory  of  past  problem  solving  episodes 
(cases).  One  kind  of  useful  case  comprises  a 
sequence  of  moves,  the  task  to  which  the 
sequence  was  applied,  and  the  result  whether 
positive  or  negative  (Veloso  and  Carbonell, 
1991).  Hgure  3  shows  two  successive  moves 
from  the  initial  state  of  the  puzzle  task, 
iq)plying  the  heuristic  from  Ae  previous 
paragraph  to  choose  the  second  move  (M2). 
The  heuristic  in  this  case  leads  to  a  state  loop. 
To  prevrat  this  loop  from  recurring,  either 
immediately  or  later  in  this  or  another  task, 
M]  and  its  outcome  under  this  heuristic  can  be 
learned,  for  recall  under  similar 
circumstances. 


Case  memory  itself  is  a  target,  because  die 
more  cases  learned  the  better  (if  they  are 
stored  and  retrieved  properly).  When  a  case  is 
stored,  a  component  must  d^de  how  the  case 
will  be  indexed.  This  componoit  is  also  a 
target,  because  good  indexing  d^nds  on 
knowledge  about  what  features  of  a  case  are 
relevant,  and  this  knowledge  can  be  learned. 
And  at  retrieval  time,  another  ccunponent  must 
search  for  relevant  cases.  But  cases  that  seem 
relevant  may  be  misleading,  because  indexing 
cannot  be  perfect  For  example,  in  Hgure  4 
the  move  Mj,  which  was  a  de^  end  in  Hgure 
3,  is  necessary  to  allow  the  engine  on  the 
siding  to  move  past  the  left-hand  train.  Thus 
the  retrieval  component  is  also  a  target 
because,  as  it  learns  more  about  the  domain,  it 
can  retrieve  cases  more  accurately. 

3.  Means-ends  analysis  (3  targets).  Means- 
ends  analysis  (MEA)  requires  that  the  current 
state  be  compared  with  a  desired  state  and 
differences  detected.  In  Hgure  5,  the 
unshaded  engine  begins  to  the  right  of  the 
flatcar  but  ends  up  not  to  the  right  Reducing 
this  difference  entails  getting  the  two  units 
past  each  other,  but  the  steps  needed  lead 
directly  to  the  desired  state.  MEA  is 
particularly  powerful  in  this  case. 


V  *'*  Engine  to  the  right  of  flatcar 

Engine  NOT  to  the  right  of  flatcar 


Figure  5:  Applying  means-ends  analysis 
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flfiiretf:  A  conect  bat  inefficient  soluticm 


Components  for  MEA  include  difference 
detection,  difference  ranking,  and  the 
connections  between  operators  and  the 
differences  they  reduce.  All  three  are  targets 
(see  also  Rich  1983,  p  367,  for  a  discussitm  of 
learning  in  MEA). 

4.  Solution  refinement  (1  target).  The 
switchyard  domain  admits  solutions  that  are 
correct  but  inefficient  In  Hgure  6  the  task  is 
solved  in  three  steps,  but  can  be  solved  in  one 
by  moving  the  unshaded  train  down  from  the 
siding  directly.  Humans  are  adept  at 
recognizing  and  avoiding  simple  inefficient 
sequences  like  this  one.  In  general  this 
requires  combinatoric  search,  making  solution 
refinement  a  good  target  the  problem  solver 
can  get  better  at  avoiding  inefficirat 
sequences. 

5.  Strategy  selection  (2  targets).  Given 
multiple  methods,  the  probl^  solver  must  be 
able  to  make  tactical  decisions  about  when  to 
choose  a  particular  method  and  when  to 
compose  methods  (an  example  of  composition 
would  be  the  rq)plication  of  MEA  during 
solution  refinement).  It  must  also  evaluate 
whether  a  method  is  making  progress  and 
abandon  it  if  not  The  problem  solver  can  get 
better  at  selecting  methods  and  smarter  about 
abandoning  them. 

5.  The  Uses  of  Task  Analysis 

Task  analysis  as  demonstrated  in  the  previous 
subsection  serves  several  purposes, 
summarized  here.  Rrst  it  provides  an 
external  standard  against  which  to  measure 
learning  diversity;  a  system  learns  diversely  to 
the  extent  that  its  kinds  of  learning  cover  the 
targets  identified  through  task  analysis.  This 
external  measure  of  scope  complements 
drasity,  which  provides  a  measure  internal  to 


die  system. 

Second,  it  provides  a  ^ledfication  of 
performance  methods  for  that  domain.  The 
mcMt  detailed  the  task  analysis,  die  more 
detailed  the  specification. 

Third,  and  related  to  the  previous  point,  it 
provides  a  specification  for  how  to  arrange 
ctHnmunication  and  control  between  methods. 
In  the  switchyard  example,  strategy  selection 
(method  5)  received  only  pofiinctory 
treatment,  but  in  principle  tadc  analysis  could 
yield  considerable  guidance  for  how  to 
organize  the  other  methods  to  work  together 
productively.  This  makes  task  analysis  an 
important  source  of  leverage  on  the  hard 
problems  of  integrating  multiple  methods. 

Fourth,  task  analysis  can  be  verified  for 
completeness  relative  to  human  performance. 
This  verification  serves  to  calibrate  a  domain- 
specific  taxonomy  to  the  human  scale. 

6.  SharaUe  Learning  Components 

In  any  kind  of  system-building,  cost  can  be 
reduc^  by  sharing  (reusing)  modules.  For 
learning  systems,  cost  can  be  reduced  by 
sharing  learning  components.  Sharable 
learning  components  today  are  typically  found 
embedded  in  general  problem-solving 
architectures.  Examples  include  chunking  in 
Soar  (Laird  et  al,  1986),  the  Labyrinth 
conceptual-clustering  module  in  k:arus 
(Langley  et  al,  1991),  and  the  various  learning 
modules  in  Prodigy  (Carbonell  et  al,  1991). 

Two  factors  affect  the  promise  of  a  particular 
architecture  to  support  broad  learning  scope. 
Hrst  is  the  extent  to  which  the  architecture 
reduces  the  overhead  of  expanding  the 
learning  scope  of  a  system.  Ideally,  adding  a 


new  performance  method  would  be  routine. 
So  would  connecting  it  to  a  sharable  learning 
component  to  create  one  or  more  learning 
targets.  In  fact,  however,  routine  addition  of 
new  methods  and  routine  integration  with 
learning  components  has  not  been  achieved. 
For  example,  correct  integration  of  chunking 
into  Soar  performance  systems  is  highly 
sensitive  to  the  representation  of  performance 
knowledge  (Laird  et  al,  1986).  But  this 
sensitivity  may  be  endemic  to  learning 
systems  (Flann  and  Dietterich,  1989);  if  so, 
the  relative  promise  of  sharable  components 
remains. 

The  second  factor  affecting  an  architecture’s 
potential  for  supporting  scope  is  the  extent  to 
which  its  sharable  learning  components  are 
general.  The  more  a  component  can  learn 
about,  the  more  targets  are  possible. 
Generality  depends  on  how  the  component  is 
integrated  with  other  components  in  the 
architecture. 

Chunking,  for  example,  is  integrated  with 
Soar’s  control  strategy,  universal  subgoaling 
(Laird,  1984).  In  universal  subgoaling,  a  gap 
or  complexity  in  performance  knowledge 
prompts  the  architecture  to  set  a  subgoal. 
When  problem-solving  achieves  the  subgoal, 
chunking  captures  the  element  of  knowledge, 
whether  compiled  from  knowledge  elsewhere 
in  the  system  or  brand  new  (Newell,  1990). 
Thus,  although  chunking  has  only  one 
architectural  target  (production  memory),  it 
has  in  principle  an  unbounded  number  of 
performance-lmowledge  targets.  So  far,  it  has 
been  q)plied  (in  different  systems)  to 
instruction  taking  (Huffman,  1991),  induction 
and  knowledge-level  learning  (Rosenbloom 
and  Aa^an,  1990;  Miller  and  Laird,  1991), 
and  integration  of  multiple  sources  of 
knowledge  in  natural  language  understanding 
(Lehman  et  al,  1991),  as  well  as  speedup 
learning  (Laird  et  al,  1986;  Tambe,  1991).  (A 
summary  of  references  for  varieties  of 
learning  in  Soar  appears  in  the  introduction  to 


Rosenbloom  et  al,  1993). 

The  accumulated  evidence  for  chunking 
suggests  that  it  is  general  enough  to  support 
unbounded  increases  in  scope,  but  that 
posability  has  not  been  proven.  The  promise 
of  mechanisms  like  chunking  has  yet  to  be 
tested  in  an  effort  aimed  directly  at  achieving 
great  scope  in  a  single  system. 

7.  Implicatioiis  for  Multistrategy 
Learning 

Two  observations  and  an  ensuing  hypothesis 
can  be  drawn  from  this  discussion  of  scope 
measures  and  sharable  learning  components. 

The  frrst  observation  is  that  solutions  to 
problems  like  integration  and  control  of 
multiple  learning  components  depend  as  much 
on  the  domain  as  they  do  on  generic 
components.  This  is  borne  out  in  the  task 
analysis,  in  the  need  for  a  strategy  selection 
method  (page  5).  Here  the  perfunctory 
treatment  of  this  method  hides  many  of  the 
problems  of  getting  the  other  methods  to 
perform  and  learn  together,  but  more  detailed 
task  analysis  would  help  define  these 
problems  and  guide  solutions.  General 
solutions  to  these  hard  problems  may  emeige 
only  through  pursuit  of  broad  learning  scope 
on  a  variety  of  domains. 

The  second  observation  is  that  sharable 
learning  components  are  potentially  better  at 
enabling  great  scope  than  special-purpose, 
hardwired  implementations  of  learning 
algorithms.  Relative  to  a  domain-specific 
measure  of  learning  scope,  the  variety  of 
generic  learning  algorithms  in  a  system  is  a 
secondary  issue;  as  few  as  one  might  do  to 
achieve  great  scope  relative  to  the  domain  at 
hand. 

These  two  observations  lead  to  the  following 
hypothesis:  A  direct  and  comparatively  low- 
overhead  way  to  face  hard  problems  in 
multistrategy  learning,  such  as  control  and 
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integration  of  learning  components,  is  to 
pursue  Inroad  learning  scope  for  specific 
domains  within  architectures,  guid^  by 
comprehensive  task  analysis. 

8.  Summary 

Broad  learning  scope  involves  modifying  a 
variety  of  bodies  of  performance  knowledge. 
Two  qualities  of  scope  are  density  and 
diversity.  Density  measures  the  degree  to 
which  learning  pervades  a  system.  Diversity 
measures  the  degree  to  which  learning  covers 
the  kinds  of  learning  afforded  by  the  domain. 

A  domain-specific  taxonomy  for  measuring 
learning  diversity  is  a  product  of  task  analysis, 
which  also  helps  specify  performance 
methods,  including  those  responsible  for 
control  flow  and  communication.  Task 
analysis  can  be  checked  for  completeness 
against  human  behavior,  and  serves  to 
calibrate  a  domain-specific  taxonomy  of 
learning  forms  to  human-scale  performance. 

Building  systems  with  broad  scope  is 
expensive  because  of  the  high  marginal  cost 
of  building  new  system  components  to  modify 
new  targets.  Sharable  learning  components, 
as  embedded  in  learning  architectures,  hold 
the  promise  of  reducing  this  overhead. 

Sharable  learning  components,  providing  cost 
advantage,  and  task  analysis,  providing 
method  specifications,  guidance  for 
integration  and  control,  and  means  of 
measurement,  both  need  to  be  exploited  in  the 
building  of  systems  with  broad  learning  scope. 
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Abstract 

This  paper  introduces  the  concept  of  multi- 
source  learning.  It  demonstrates  how  a  multi- 
source  learner  can  be  used  as  a  framework 
for  multisu-ategy  learning.  A  variation  of  the 
PAC  model  of  learning  correctness  called 
PAC-Error  identilication  is  defined  and 
shown  to  be  suitable  for  multi.source  learn¬ 
ing.  An  algorithm  for  PAC-Error  identifica¬ 
tion  is  developed  for  a  knowledge  httse  of 
definite  clauses.  It  is  shown  that  PAC-Error 
identification  is  effective  in  integrating  defi¬ 
nite  clauses  that  have  been  contributed  from 
several  sources.  This  framework  can  be  u.sed 
as  the  basis  of  a  multistrategy  learner  which 
can  produce  hypotheses  with  small  error. 

Keywords:  Multistrategy  Learning 
Multisource  Learning 
PAC  Learning 

1  Introduction 

A  multistrategy  learner  combines  learning 
strategies  to  produce  effective  hypoihe,se.s.  If 
each  of  the  learning  strategies  was  imple¬ 
mented  as  an  independent  entity  that  contrib¬ 
uted  knowledge  to  a  knowledge  ba.se  then  an 
algorithm  would  be  required  to  integrate  this 
knowledge  to  produce  an  effective  hypothe¬ 
sis.  A  mullisource  learner  implements  such 
an  algorithm.  Effectiveness  is  measured  as 
the  ability  of  the  resultant  hypothesis  to  .sat¬ 


isfactorily  complete  performance  tasks. 

The  original  idea  behind  multisource  learn¬ 
ing  was  to  permit  knowledge  sources,  which 
were  themselves  independent  learning  ele¬ 
ments,  to  contribute  their  knowledge  to  a 
knowledge  base.  It  was  envi.saged  that  the 
knowledge  ba.se  would  contain  multiple  tar¬ 
get  concepts  and  the  learning  elements  could 
be  u.sed  to  build  different  parts  of  the  knowl¬ 
edge  ha.se.  Each  learning  element  could  also 
implement  a  different  strategy.  For  example, 
an  explanation  based  learner  could  be  used  to 
improve  efficiency,  an  abductive  learning 
element  would  contribute  new,  plausible 
clauses  and  an  inductive  learning  element 
could  construct  new  clauses  from  examples. 

For  our  purposes  we  assume  that  the  knowl¬ 
edge  base  is  a  set  of  definite  clauses. 

A  multisource  learner  must  select  those 
clauses  from  the  knowledge  base  which 
result  in  satisfactory  performance  across  the 
population  of  tasks.  Collectively  the  selected 
clau.scs  are  a  hypothesis. 

A  major  issue  in  extracting  a  hypothesis  is  in 
determining  a  model  of  correctness.  A  logic 
programming  definition  could  be  that  all  log¬ 
ical  consequences  of  a  hypothesis  h  are  in  the 
intended  interpretation  and  all  elements  of 
the  intended  inteipretation  are  logical  conse- 


quences  of  h.  There  are  two  problems  with 
this  definition: 

1.  The  hypothesis  can  often  be  more  gen¬ 
eral  than  what  is  required.  It  requires 
only  those  clauses  relevant  to  the  set  of 
tasks  to  be  performed. 

2.  Exact  identification  of  the  hypothesis 
can  be  too  strong  as  the  criterion  for 
correctness.  A  weaker  variant  may  be 
more  practical. 

This  paper  outlines  a  model  for  multisource 
learning  and  presents  an  algorithm  for  what 
can  be  termed  PAC-Eiror  identification.  This 
algorithm  is  based  on  a  PAC  correctness 
model. 

2  A  Model  of  Learning 

A  multisource  learner  is  shown  in  Figure  I . 


Figure  1 

Its  knowledge  base  contains  a  set  of  deliniie 
clauses.  Each  knowledge  source  independ¬ 
ently  contributes  definite  ciau.ses  to  the 
knowledge  base.  The  multisource  learner 
accepts  tasks  in  the  form  of  classified  goal 
clauses  from  the  teacher.  The  aim  of  the  mul¬ 
tisource  learner  is  to  extract  a  set  of  definite 
clauses  from  the  knowledge  ba.se  that  satis¬ 
factorily  (within  the  constraints  of  the  PAC 
model)  solve  the  tasks  given  by  the  teacher. 

In  the  case  of  multistrategy  learning  the 
knowledge  sources  implement  varitnis  learn¬ 
ing  strategies.  The  details  of  these  knowl¬ 


edge  sources  (in  this  case  learning  elements) 
are  beyond  the  scope  of  this  paper  other  than 
to  show  the  contributions  that  each  learning 
element  makes  to  the  knowledge  base. 

The  multisource  model  is  a  supervised  learn¬ 
ing  model.  Typically,  in  supervised  models, 
examples  are  vectors  of  attribute-value  pairs 
and  the  classification  indicates  whether  the 
example  is  a  member  of  the  target  concept  or 
not.  The  examples  given  by  the  teacher  in 
multisource  learning  are  goal  clauses  where 
a  simple  boolean  flag  is  not  sufficient  for 
classification.  Instead  the  teacher  must  pro¬ 
vide  the  set  of  goal  clause  instances  which 
are  members  of  the  target  concept. 

Example  J:  Suppose  iliai  a  knowledge  base  is  to  con¬ 
tain  facts  alxmt  relationships  between  family  mem¬ 
bers.  A  goal  clause  could  be  ^fatheiffred,  C).  The 
instances  which  are  a  memlter  of  the  target  concept 
could  be  ffather(fred,  ian),  father(fred.  janej. 

3  Correctness  of  Learning 

Traditionally,  the  ideal  for  the  correctness  of 
a  hypothesis  has  been  that  it  determine  all 
positive  members  of  a  target  concept  to  be 
po.siiive  and  negative  members  to  be  nega¬ 
tive.  The  sample  complexity  for  this  task  can 
be  large.  No  matter  how  many  examples  are 
used  to  derive  a  hypothesis  an  example  can 
be  presented  which  invalidates  the  hypothe¬ 
sis. 

Probably  approximately  correct  (PAC)  learn¬ 
ing  (Valiant.  1984)  permits  a  probabilistic 
characterization  of  a  hypothesis.  After  n 
examples  the  probability  of  being  presented 
with  an  example  that  is  inconsistent  with  the 
derived  hypothesis  is  le.ss  than  an  error 
threshold  £  with  a  probability  1  -  5.  The  liter¬ 
ature  contains  many  results  for  learning 
classes  of  hypotheses  (Kearns,  1989). 

In  muliisource  learning,  as  de.scribed  in  this 
paper,  the  hypothesis  class  H  is  the  power  set 


of  the  set  of  (tefinite  clauses  in  the  knowl¬ 
edge  base.  A  PAC  learning  algorithm  would 
require  that  a  hypothesis  be  identified  from 
this  set  that  has  error  less  than  e  with  high 
confidence.  Clearly,  for  multisource  learning, 
it  is  not  always  possible  to  find  such  a 
hypothesis  in  H  because  H  is  dependent  on 
the  clauses  available  in  the  knowledge  base. 
For  multisource  learning  the  hypothesis 
should  have  the  minimal  error  in  the  set  of 
hypotheses.  The  multisource  learner, 
described  in  this  paper,  performs  what  can  be 
termed  PAC-Error  identification  where  it 
returns  a  hypothesis  h  that  with  confidence  1 
-  S  has  an  error  probability  in  the  interval  [e  - 
d,  £  +  d].  The  constant  d  determines  the 
closeness  with  which  the  error  is  to  be 
approximated.  Ideally  the  hypothesis  h  has 
the  smallest  e  in  H. 

4  Probability  Regions 

A  goal  clau.se.  g,  permits  several  instances*. 
Each  goal  clause  has  a  set  of  instances  which 
are  members  of  the  target  concept.  Demite 
this  set  by  Tg.  A  hypothesis,  h,  determines 
the  set  of  goal  instances,  hg  that  logically  fol¬ 
low  from  it. 

Each  goal  instance  has  a  relative  probability 
defined  to  be  the  probability  of  its  parent 
goal  divided  by  the  number  of  instances  pos¬ 
sible  for  the  parent  goal.  The  total  probability 
of  a  goal  instance  is  the  sum  of  lelalive  prob¬ 
abilities  of  all  its  parent  goal  clau.ses. 

Figure  2  shows  the  universe  of  instances  of 
those  goal  clauses  available  to  the  teacher 
and  the  relationship  between  Tg  and  hg. 

Let  Ty  be  the  set  of  all  goal  instances  from  U 
that  are  members  of  the  target  concept  and  hy 
be  the  set  of  all  instances  from  U  that  logi- 


1.  A  gmund  instance  is  durived  by  substituting 
constants  for  all  variables  in  the  goal  clause.  Tor 
this  paper  assume  all  instances  are  ground. 


cally  follow  from  h. 


Four  probability  regions  are  defined: 

False  positive  region  (FP):  £>enotes  the  set 
of  instances  that  are  in  h„  but  not  in  T,,. 
Let  FP-probability  be  the  sum  of  the 
probabilities  of  the  instances  in  this 
region. 

IVue  negative  region  (TN):  Denotes  the  set 
of  instances  that  are  not  in  hy  and  not  in 
Ty.  Let  TN-probability  be  the  sum  of  the 
probabilities  in  this  region. 

IVue  positive  region  (TP):  Denotes  the  set 
of  instances  that  are  in  hy  and  Ty.  LetTP- 
probabiliiy  be  the  sum  of  the  probabili¬ 
ties  in  this  region. 

False  negative  region  (FN):  Denotes  the  set 
of  instances  that  are  not  in  hy  but  are  in 
Tj,.  Let  FN-probability  be  the  sum  of 
probabilities  in  this  region. 

The  error  probability  of  a  hypotheses  h  is  the 
sum  of  the  FP-probability  and  the  FN-proba¬ 
bility. 

Figure  3  shows  the  probability  regions  across 
the  universe  of  goal  instances. 


Figure  3 

Tliere  are  two  observations  about  these  prob¬ 
ability  regions  that  aid  in  the  development  of 
a  learning  algorithm: 

1.  The  deletion  of  a  clause  can  result  in  the 


transfer  of  TP-probability  to  FN-proba- 
bility  and  FP-probability  to  TP-proba¬ 
bility. 

2.  The  addition  of  a  clause  can  result  in  the 
transfer  of  FN-probability  to  TP-proba¬ 
bility  and  TN-probability  to  FP-proba- 
bility. 

5  Determining  Confidence  in  the 
Error 

Consider  a  population  of  examples  which  are 
either  classified  as  positive  or  negative  with 
respect  to  some  target  concept.  What  is  the 
probability  that  a  randomly  chosen  example 
will  be  incorrectly  classified  by  a  hypothesis 
h.  This  is  the  error  rate  for  the  hypothesis  h. 

The  error  rate  for  a  hypothesis  h  can  he 
approximated  by  sampling.  Suppose  that  n 
examples  are  chosen  at  random  from  the 
population.  Let  Xj  be  0  if  example  i  is  cor¬ 
rectly  classified  and  1  otherwi.se.  Each  X|  is  a 
random  variable  and  CXi,  X2,  ...,  X„>  is  a 
sample  of  the  population.  Tire  sample  mean 
is  X.  The  Xj  are  independent  and  identically 
distributed. 

The  expected  probability  of  error  is  the  sum 
of  the  probabilities  of  incorrectly  classified 
examples  across  the  population.  This  mean 
can  be  estimated  and  a  confidence  limit  for  it 
can  be  determined  through  sampling. 

Assume  that  n  >  30  and  the  estimate  of 
standard  deviation  s  has  inconsequential 
error  with  respect  to  a.  The  distribution  of 
the  sample  mean  X  is  approximately  normal. 
Two  theorems  are  useful  in  this  case  (Wal¬ 
pole  et.  al.,  1978). 

Theorem  1:  If  x  is  used  as  an  estimate  of  |i. 
we  can  be  (1  -  5)l(M)*v?>  confident  that 
error  will  be  less  than 


The  value  Z5/2  is  the  value  of  the  standard¬ 
ized  normal  variable  below  which  there 
is  an  area  of  S/2. 

Dieorem  2:  if  x  is  used  as  an  estimate  of  p 
we  can  be  (1  -  5)100%  confident  that  the 
error  will  be  less  than  a  specified  amount 
d  when  the  sample  size  is: 


n 


( 


d  ^ 


2 


Example  2:  Suppose  that  a  sample  of  40  examples  was 
taken  from  the  population  of  examples.  Ten  of  these 
examples  are  incorrect.  The  sample  standard  devia¬ 
tion  s  is  0.1923.  Assuming  that  s  accurately  estimates 
a  then  the  number  of  examples  required  such  that  the 
difference  between  the  sample  mean  and  the  true 
mean  is  less  than  0.01  with  confidence  0.95  is: 


n 


Z5^jn.l923  ’ 


0.01 


=  1421 


The  sample  mean  is  the  error  rate. 


Tliesc  confidence  calculations  apply  equally 
to  the  case  where  each  Xj  has  an  error  value 
in  the  range  [0,  1].  This  is  the  cose  for  goal 
clau.ses.  Each  goal  instance  of  a  goal  clause  g 
has  a  value  which  is  l/(#  instances  of  g).  This 
value  assumes  that  each  instance  of  a  goal 
ciau.se  is  equally  important.  The  error  value 
for  the  goal  clause  is  the  sum  of  values  of 
those  goal  instances  in  hg  A  Tg  (the  symmet¬ 
ric  difference).  The  expected  value  for  this 
error  value  gives  the  error  rate  across  goal 
clauses.  The  error  rate  can  be  interpreted  as 
the  probability  that  an  arbitrary  goal  clause  g 
will  have  instances  in  Tg  not  in  hg  or 
instances  in  hg  not  in  Tg. 

6  PAC-Error  Identification 


The  algorithm  presented  here  performs  PAC- 
Enor  identification  of  a  hypothesis  h  C  C 
where  C  is  the  set  of  clauses  available  in  the 
knowledge  base. 


Roughly,  the  algorithm  calculates  the  TP,  FP, 
FN  and  TN  probabilities  with  h  =  C.  It  main- 


tains  sufficient  information  to  determine  ail 
probabilities  of  any  h'  €  H.  A  hill-climbing 
search  is  applied  to  obtain  a  local  optimum 
of  E  and  the  final  hypothesis  hj.  is  returned  as 
the  result  The  required  information  includes 
the  TP  and  FP  probabilities  of  each  clause 
and  the  set  of  refutation  sets  (a  refutation  set 
is  the  set  of  clauses  involved  in  a  refutation 
of  a  goal  instance).  Each  iteration  of  the 
algorithm  deletes  a  clause  until  no  further 
deletions  improve  the  quality  of  the  hypothe¬ 
sis.  A  sequence  of  hypotheses  is  derived; 

Each  element  hi  in  the  sequence  has  less 
error  and  one  less  clause  than  its  predecessor. 
Each  element  h,  has  values  for  TP.  FP.  FN 
and  TN.  The  resultant  hypothesis  hj.  is 
returned  with  its  error  probability  £. 

The  algorithm  has  two  major  components: 
determining  region  probabilities  with  the 
desired  confidence  (ContThresh)  and  identi¬ 
fying  the  clause  set  that  has  small  error 
(PAC-Error  Ident). 

ConfThresh(5,  d) 

1.  Randomly  select  m  >  30  goal  clauses,  calculate 
(be  error  of  each  and  find  (he  sample  standard 
deviation  s.  Calculate  the  number  of  samples  n 
required  to  obtain  the  necessary  confidence  1  - 
5  for  the  interval  size  2d. 

2.  Initialise  the  region  value.s^  TP.  FN,  FP  and  TN 

toO. 

3.  Let  h  be  the  .set  of  clauses  in  the  knowledge  b:use. 

4.  Perform  n  times: 

(a)  Accept  a  cla.s.silied  goal  clause  Iroiii  the 
teacher.  Tg  is  given. 

(b)  Derive  the  set  of  goal  instances  itnplied 
by  h.  Retain  die  refutation  .set  R^,  of  every  goal 
instance  i. 

(c)  Calculate  the  value  of  each  go:d  insuince  gi. 
Let  this  value  be  denoted  by  v„,. 

(d)  For  each  goal  instance,  gj  e  h^: 

•  If  gj  e  Tg  increment  TP  by  v„j.  Also  incre¬ 
ment  the  TP  of  each  clau.se  in  R.^,  ;uid  die  TP 


2.  Region  values  only  liccoinu  prohahilities  after 
division  by  the  sample  count,  n. 


ofRgjby  v^. 

•  If  gj  cTg  increment  FP  by  Vp.  Also  incre¬ 
ment  the  ^  of  each  clau.se  in  Rp  and  the  FP 
ofRpby  Vp. 

(e)  For  each  tj  €Tg  -  h^  increment  FN  by  v^. 

5.  Divide  TP,  FP.  FN  and  TN  by  the  sample  count 

n. 

6.  Perform  PAC-Ermr-Ident  (TP,  FP,  FN,  TN,  h). 

It  must  be  noted  that  when  maintaining  prob¬ 
abilities  for  clauses  and  refutation  sets  only 
the  probabilities  TP  and  FP  have  any  mean¬ 
ing. 

PAC-Error-Ident(TP,  FP,  FN,  TN,  h) 

1. Leti=  1. 

2. Lethi  =  h. 

3.  Let  TPt  be  the  TP  value  of  clau.se  C^. 

4.  Let  FPi;  be  the  FP  value  of  clau.se  C^. 

5.  Let  Erroft  be  FP|;  -  TPij  for  ciau.se  C|j. 

6.  Let  Cj  €  hj  he  the  clau.se  with  the  highe.si  error 
value  Errorj  for  all  j. 

7.  Perform  until  no  clause  Cj  €  hj  has 
Errorj  >  0. 

(a)  For  each  clau.se  C,  €hi  where  r  j  and  Cj 
p:uticip:itcs  in  a  refutation  with  Cii 

•  Subtract  the  TP-pmbability  of  every  com¬ 
mon  refuuition  from  the  TP-prtibability  of 
clau.se  Cr 

•  Subtract  the  FP-probability  of  every  com¬ 
mon  refutation  fnim  the  FP-probability  of 
clau-se  Cf 

(b) LetTP  =  TP-TPj 

(c)  LeiFP  =  FP-TPj 

(d) LetFN  =  FN  +  TPj 

(e)  Let  TN  =  TN  +  FPj 
{0Leti  =  i+  1. 

(g) Let  h,  =  hi.,-{Cj) 

(h)  Let  Cj  €  hj  be  the  clause  with  the  highe.st 
emir  value  Errorj  for  all  j. 

8.  Return  hj  and  e  =  FP  +  FN. 

7  Experimental  Work 

7.1  A  domain  theory 

Consider  the  family  tree  given  in  Figure  4. 
Assume  that  the  multistrategy  learner  must 
learn  the  relationships  that  hold  in  this  tree.  It 
is  a.s.sumed  that  a  language  is  pre-defined  for 


definite  clauses  and  goal  clauses. 

P^cia  Gj^me 

Grace 

Ann  William 
Figure  4 

The  knowledge  base  initially  contains  the 
following  clauses: 

sister(X,Y)<— parent(Z.X).parenl(Z.Y).  female(X) 

si$ier(jan.  fred)  < — 

parent(palricia,  grace)  < — 

parent(fied,  ann) 

parenC(fred,  william)  i. — 

parent(graeme,  fred)  i. — 

male(fred)  < — 

For  illustration  purposes  assume  that  the 
population  of  goal  clauses,  their  classilica- 
tions  and  probability  distribution  are  known 
and  given  in  Table  1. 


Goal  Clause  (g) 

Prob. 

taUicrtIrcd.  X) 

UthcrOrcd.  ann) 
fa(hcr(frcd.  william) 

0.2 

fathor(X.  Jan) 

falhcr(graemc.  jan) 

0..^ 

mothcr(X.  fred) 

ino(hcr(patncix 

fred) 

0.3 

!>i.stertjan.  X) 

si.NCcr(jan.  grace) 
sutei(Jan.  fied) 

0.2 

Table  1:  Goal  Clause  Probabilities 

Here  we  use  the  probability  distribution  to 
analytically  determine  the  sample  size.  Nor¬ 
mally  the  probability  distribution  would  no 
be  known  a  priori.  The  ContTliresh  algo¬ 
rithm  would  be  needed  to  determine  the  sam¬ 
ple  size. 

The  region  probabilities  are  the  expected  val¬ 
ues  of  each  region  value  across  goal  clauses. 

Example  3:  Consider  the  goal  clause  ^fcitherlfred. 
X).  There  are  seven  possible  solutions  to  this  goal 
clause  (all  constants  in  the  language  i.e.  all  family 
tree  members).  The  value  of  any  one  .solution  is  1/7. 

The  original  domain  theory  does  not  entail  any 


instances  of  this  goal  clause  but  there  should  be  two 
(ann  and  william).  The  FN-value  is  2/7,  the  FP-vahte 
is  0,  the  TP-value  is  0  and  the  TN-value  is  5/7. 

7.2  Knowledge  sources/leaming  elements 

The  multisource  learner  becomes  a  multist¬ 
rategy  learner  when  it  permits  knowledge 
sources  to  become  independent  learning  ele¬ 
ments.  The  overall  strategy  of  each  learning 
element  is  described  and  its  contributions  to 
the  knowledge  base  are  given. 

7.2.1  Learning  by  abduction 

A  learning  element  performs  abduction  by 
selecting  a  clause  from  the  knowledge  base 
whose  head  is  proven  but  whose  body  is  yet 
to  be  proven  and  concludes  the  instance  of 
the  body.  The  instantiated  elements  of  the 
body  are  added  to  the  knowledge  base. 

Using  the  initial  domain  theory,  in  particular 
the  clauses: 

.si.ster(X.Y)< — parcnt(ZJ([).pan:ni(Z.Y),  fcmalu(X) 
sistvr(jan.  ftvd) 

The  unit  clauses  parent(patricia,  jan), 
parent(patricia,  fred)  and  female(jan)  could 
be  added  to  the  knowledge  base. 

7.2.2  Learning  by  induction 

An  inductive  learner  could  consider  the  fol¬ 
lowing  examples  for  father(X,  Y): 

•  parcni(graeme.  grace),  male(graeme),  femaie(grace)  {■*■) 
«  pun.*n((grucmu.  jan).  malc(grjeme).  female(jan)  (-•■) 

•  parcnUfrcd.  ann),  malc(frcd),  fumale(ann)  (■«■) 

•  parcnttpatricia.  fred).  femalc(patricta).  male(fred)  (-) 

Two  plausible  hypotheses  that  an  inductive 
learning  element  might  produce  to  explain 
these  facts  could  be: 

faUierfX.Y)  i —  maic(X),  parent(X.  Y) 
faUieitX.Y)  < —  female(Y).  parent(X.Y) 

7.2.3  Learning  by  discovery 

A  learning  element  could  be  defined  to  pro- 
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duce  arbitrary  clauses.  This  learning  element 
might  be  constructed  analogously  to  a  cla.s.si- 
fier  system  (Wilson,  1987)  where  strong 
clauses  (reproduction),  mixed  (crossover) 
and  modified  (mutation). 

Assume  that  a  learning  element  performs 
discovery  and  produces  the  clause: 

molher(X.Y)  i —  parent(X.Y) 

7.3  PAC-Error  identification 

Suppose  the  knowledge  base  has  gained 
clauses  from  each  of  the  learning  elements 
described  previously.  It  contains  the  follow¬ 
ing  clauses: 

(Cl)  fathcr(X.Y)  i —  inalo(X).  paren((X.  Y) 

(C2)  fathef(X.Y)  i —  fcmak(X).paft:nt(X.Y) 

(C3)  sistcf(X.Y)  i —  parcnt(Z.X).pan:nt(2.Y).  l‘cinalc(X) 
(C4)  moihcr(X.Y)  < —  parcm(X.Y) 

(C5)  si.Mcr(jaii.  frcU)  <— 

(C6)  pan:nt(paeicia.  jan)  < — 

(C7)  paxcn((pa(ricia.  grace)  i — 

(C8)  parenitpairicia.  freU)  i — 

(C9)  parenUfred.  ann)  4— 

(CIO)  parenUfred.  william)  4 — 

(CU)  paren((graemc.  fred)  4— 

(C12)  ni3l«(fred)  4 — 

(C13)  female(jan)  4 — 

This  knowledge  base  h  entails  the  solutions 
given  in  Table  2  with  respect  to  the  goal 
clauses: 


Goal  Clause(g) 

h,. 

c 

lulheitlrcd.  X) 

talhcrOred.  ann) 
fa(tier(fred.  william) 

faiher(X.  jun) 

tatherlpalricia, 

grace) 

motherlX.  fred) 

molhertpatneia. 

graeme) 

.si!>(er(jaii.  X) 

si.slerOan.  grace) 
sisterijan.  fred) 
sister(jan.  jan) 

Table  2 


The  true  standard  deviation  is: 

a  =  Je(X^) 

The  mean  error  probability  is  the  expected 
value  of  the  error  value  which  is: 

4  =  E (error)  =  ?  x 0.3  +  i  x 03  +  1  xO.2  =  0.t571 

The  value  of  a  is  0.1.  The  sample  size 
required  to  estimate  the  error  with  d  =  0.01 
and  5  =  0.05  is  385. 


A  typical  execution  of  the  ConfThresh  algo¬ 
rithm  would  give  the  probabilities  in  Table  3 


g  instance 

Refutation 

Set 

TP 

FP 

l'att)er(trud.ann) 

Ci.cio.OlJ 

0.t)2«.S 

0 

falhcr(lrud.william) 

Cl.ClO. 

CI2 

0.0285 

0 

father!  patricia.jan) 

C2.Cf..Cl3 

0 

0.0428 

inotliurtpatncia.frcd) 

C4.C8 

0.0428 

0 

mi)thcr(graemu.frcd ) 

c:4.cii 

0 

0.0428 

.si.\ter(jan.grace) 

C3.Cf).C7. 

C13 

0.028S 

0 

.sistert  tan. fred) 

C3.c:6.(.:«. 

C13 

0.0285 

0 

.si.ster(jan.jan) 

C3.C6.C13 

0 

0.0285 

Table  3: Refutation  Sets 


and  Table  4. 


Clause 

TP 

FP 

Cl 

0.057 

0 

C2 

0 

0.0428 

C3 

0.057 

0.0285 

C4 

0.0428 

0.0428 

c:5 

0.0285 

0 

C6 

0.057 

0.0713 

C7 

0.0285 

0 

C8 

0.0713 

0 

CV 

0.0285 

0 

CIO 

0.0285 

0 

Cll 

0 

0.0428 

C12 

0.057 

0 

C13 

0.057 

0.0713 

Table  4;CIaiise  Probabilities 


The  FN-probability  for  the  hypothesis  h  is 


0.0428  and  the  FP-probability  is  0.1141.  The 
total  error  is  0.1569.  There  are  four  candi¬ 
dates  for  deletion:  C2.  C6,  Cll  and  Cl 3. 
Each  candidate  has  an  FP-probability  greater 
than  the  TP-probability.  This  means  that  if 
this  clause  were  to  be  deleted,  the  increase  in 
FN  from  TP  is  smaller  than  the  decrease  in 
FP.  The  error  probability  decreases. 

Assume  that  C2  is  chosen  for  deletion.  It 
directly  decreases  FP  with  no  effect  on  TP. 
FN  remains  the  same  but  FP  for  the  new 
hypothesis  h2  is  reduced  by  the  FP-probabil¬ 
ity  of  C2.  The  error  probability  for  h2  is 
0.0428  +  0.0713  =  0.1141. 

There  are  two  clauses  affected  by  this  dele¬ 
tion.  Both  C6  and  C13  have  their  FP-proba- 
bilities  reduced  to  0.0285. 

The  second  iteration  has  only  one  candidate 
for  deletion:  Cll.  It  also  has  TP-probabiliiy 
of  0  so  the  error  for  h3  becomes  0.0428  + 
0.0285  =  0.713. 

No  further  clauses  can  be  deleted.  The 
hypothesis  h3  =  h  -  {C2,  Cll }  is  a  local  opti¬ 
mum. 

8  Conclusion 

This  paper  has  several  results: 

1.  It  introduces  multisource  learning  and 
shows  that  it  can  be  used  as  a  frame¬ 
work  for  multistrategy  learning. 

2.  It  develops  a  PAC  characterization  for 
function  free  definite  claicses  with 
respect  to  a  population  of  goal  claicses. 

3.  It  introduces  PAC-Error  identification  as 
a  reformulation  of  PAC-learning  more 
suitable  for  a  mullisource  learning 
model. 

4.  It  calculates  appropriate  sample  sizes 
for  obtaining  confidence  in  hypothe.scs. 
Formulas  are  based  on  the  central  limit 
theorem  from  statistics. 


5.  It  shows  what  probability  regions  exist 
across  a  universe  of  instances  and  how 
these  regions  respond  to  a  learning 
algorithm. 

Results  using  this  model  have  been  promis¬ 
ing  but  are  in  the  preliminary  stages.  It  must 
be  determined  how  successfully  this  model 
scales  to  a  larger  domain  theory.  Exploring 
various  combinations  of  the  number  and 
types  of  learning  elements  would  determine 
the  generality  of  this  model. 
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Abstract 

This  paper  presents  a  major  revision  of  the 
Either  propositional  theory  refinement 
system.  Two  issues  are  discussed.  First,  we 
show  how  run  time  efficiency  can  be  greatly 
improved  by  changing  from  a  exhaustive 
scheme  for  computing  repairs  to  an  itera¬ 
tive  greedy  method.  Second,  we  show  how 
to  extend  Either  to  refine  M-of-N  rules. 

The  resulting  algorithm.  Neither  (New 
Either),  is  more  than  an  order  of  magni¬ 
tude  faster  and  produces  significantly  more 
accurate  results  with  theories  that  fit  the 
M-of-N  format.  To  demonstrate  the  ad¬ 
vantages  of  Neither,  we  present  prelimi¬ 
nary  experimental  results  comparing  it  to 
Either  and  various  other  systems  on  re¬ 
fining  the  DNA  promoter  domain  theory. 

1  Introduction 

Recently,  a  number  of  machine  learning  systems 
have  been  developed  that  use  examples  to  revise  an 
approximate  (incomplete  and/or  incorrect)  domain 
theory  [Ginsberg,  1990;  Ourston  ^'nd  Mooney,  1990; 
Towell  and  Shavlik,  1991;  Danyluk,  1991;  White¬ 
hall  et  ai,  1991;  Matwin  and  Plante,  1991].  Most 
of  these  systems  revise  theories  composed  of  strict 
if-then  rules  (Horn  clauses).  However,  many  con¬ 
cepts  sue  best  represented  using  some  form  of  par¬ 
tial  matching  or  evidence  summing,  such  as  M-of- 
N  concepts,  which  are  true  if  at  least  M  of  a  set 
of  N  specified  features  are  present  in  an  example. 

*  Supported  by  the  NASA  Graduate  Student  Re¬ 
searchers  Program  under  grant  number  NGT-50732,  the 
National  Science  Foundation  under  grant  IRl-9102926, 
and  a  grant  &om  the  Texas  Advanced  Research  Program 
under  grant  003658144.  This  paper  was  originally  pub¬ 
lished  in  the  proceedings  of  the  Thirteenth  International 
Joint  Conference  on  Aritficial  Intelligence. 


There  has  been  some  work  on  the  induction  of  M-of- 
N  rules  that  demonstrates  the  advtintages  of  this  rep¬ 
resentation  [Spackman,  1988;  Murphy  and  Pazzani, 
1991].  Other  work  has  focused  on  revising  rules  that 
have  real-valued  weights  [Towell  and  Shavlik,  1992; 
Mahoney  and  Mooney,  1992].  However,  revising  the¬ 
ories  with  simple  M-of-N  rules  has  not  previously 
been  addressed.  Since  M-of-N  rules  are  more  con¬ 
strained  than  rules  with  reed-valued  weights,  they 
provide  a  stronger  bias  emd  eure  easier  to  compre¬ 
hend. 

This  paper  presents  a  major  revision  of  the 
Either  propositional  theory  refinement  system 
[Ourston  and  Mooney,  1990;  Ourston  and  Mooney, 
in  press]  that  is  significantly  more  efficient  and  is 
also  capable  of  revising  theories  with  M-of-N  rules. 
Either  is  inefficient  because  it  computes  a  poten¬ 
tially  exponential  number  of  repeurs  for  ea(^  fail¬ 
ing  example.  The  new  version,  Neither  (New 
Either),  computes  only  the  single  best  repair  for 
example,  and  is  therefore  much  more  efficient. 

Also,  because  it  was  restricted  to  strict  Horn- 
clause  theories.  Either  did  not  produce  as  accurate 
results  as  Kbann  (a  neural-network  revision  system) 
on  the  DNA  promoter  problem  [Towell  amd  Shavlik, 
1991;  Towell  amd  Shavlik,  1992].  Some  aspects  of  the 
promoter  concept  fit  the  M-of-N  format,  since  there 
are  several  potential  sites  where  hydrogen  bonds  cam 
form  between  the  DNA  amd  a  protein;  if  enough 
of  these  bonds  form,  promoter  activity  can  occur. 
Either  attempts  to  learn  this  concept  by  forming 
a  sepairate  rule  for  each  potential  configuration  by 
deleting  different  combinations  of  amtecedents  from 
the  initial  rules.  Since  a  combinatoric  number  of 
such  rules  is  needed  to  accurately  model  an  M-of-N 
concept,  the  generadity  of  the  resulting  theory  is  im¬ 
paired.  Neither,  however,  includes  the  ability  to 
generadize  a  rule  by  lowering  the  threshold  on  an  M- 
of-N  rule.  Including  threshold  changes  as  an  alterna¬ 
tive  method  for  covering  misclassified  exaunples  was 
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easily  incorporated  within  the  basic  EITHER  frame¬ 
work. 

To  demonstrate  the  advantages  of  NEITHER,  we 
present  experimental  results  comparing  it  to  Either 
and  various  other  systems  on  refining  the  promoter 
domain  theory.  Neither  runs  more  than  an  order  of 
magnitude  faster  than  Either  and  produces  a  sig¬ 
nificantly  more  accurate  theory  with  minor  revisions 
that  are  easy  to  understand. 

2  Theory  Revision  Algorithm 

2.1  The  Either  Algorithm 
The  original  EITHER  theory  refinement  algorithm 
has  been  presented  in  various  levels  of  detail  in 
[Ourston  and  Mooney,  1990;  Ourston  and  Mooney, 
in  press;  Ourston,  1991].  It  was  designed  to  re¬ 
pair  propositional  Horn-clause  theories  that  are  ei¬ 
ther  overly-general  or  overly-specific  or  both.  An 
overly-general  theory  is  one  that  causes  am  example 
(called  a  failing  negative)  to  be  classified  in  cate¬ 
gories  other  than  its  own.  Either  specializes  exist¬ 
ing  amtecedents,  adds  new  antecedents,  amd  retracts 
rules  to  fix  these  problems.  An  overly-specific  the¬ 
ory  causes  an  example  (cadled  a  failing  positive)  not 
to  be  classified  in  its  own  category.  Either  retracts 
and  generalizes  existing  antecedents  and  learns  new 
rules  to  fix  these  problems.  Unlike  other  theory  re¬ 
vision  systems  that  perform  hill-climbing  (and  axe 
therefore  subject  to  local  maxima).  Either  is  guar¬ 
anteed  to  fix  any  arbitrarily  incorrect  propositionad 
Horn-clause  theory  [Ourston,  1991]. 

Either  Main  Loop 

Compute  all  repairs  lor  each  example 
Uhile  some  examples  remain  uncovered 
Add  best  repair  to  cover  set 
Remove  examples  covered  by  repair 
end 

Apply  repairs  in  cover  set  to  theory 

Neither  Main  Loop 

Uhile  some  examples  remain 

Compute  a  single  repair  lor  each  example 
Apply  best  repair  to  theory 
Remove  examples  limed  by  repair 
end 


Figure  1:  Compaurison  of  Either  and  Neither  al¬ 
gorithms. 

The  algorithm  used  by  Either  for  both  general¬ 
ization  and  specialization  is  shown  in  the  top  half  of 
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aunple.  Unprovable  antecedents  are  shown  with  dot¬ 
ted  lines. 


Figure  1.  There  are  three  basic  steps.  First,  all  pos¬ 
sible  repairs  for  eau:h  failing  example  aire  computed. 
Next,  Either  enters  a  loop  to  compute  a  subset  of 
these  repadrs  that  can  be  applied  to  the  theory  to 
fix  ail  of  the  fauling  examples.  This  subset  is  cadled 
a  cover.  Repairs  are  ranked  according  to  a  benefit- 
to-cost  ratio  that  trades  off  the  number  of  examples 
covered  agaunst  the  size  of  the  repadr  and  the  number 
of  new  failing  examples  it  creates.  The  best  repair 
is  added  to  the  cover  on  each  iteration.  Lastly,  the 
repadrs  in  the  cover  aue  applied  to  the  theory.  If  the 
application  of  a  repadr  over-compensates  by  creating 
new  failing  examples.  Either  paisses  the  covered  ex¬ 
amples  and  the  new  failing  examples  to  an  induction 
component.^  The  results  of  the  induction  are  auided 
as  a  new  rule  when  generalizing  or  as  additional  an¬ 
tecedents  when  specializing. 

The  time  consuming  part  of  this  adgorithm  is  the 
first  step  where  all  repairs  for  a  given  fadling  example 
are  found.  Figure  2  illustrates  this  process  for  theory 
generalization  where  Either  is  searching  for  leaf- 
rule^  antecedent  retractions  to  correct  failing  posi¬ 
tive  examples.  The  upper  half  of  the  diagram  shows 
an  input  theory  both  as  rules  (on  the  left)  and  as 

*  Either  uses  a  version  of  Id3  [Quinlan,  1986]  for  its 
induction. 

^A  leaf  rule  is  a  rule  whose  antecedents  include  an 
observable  or  an  intermediate  concept  that  is  not  the 
consequent  of  any  existing  rule. 
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Generalization 

Specialization 

change 

resulting  rule 

b 

e 

be 

change 

resulting  rule 

b 

c 

be 

orig.  rule 

u  *-  2  ot  (b,c) 

N 

N 

Y 

orig.  rule 

a  <—  1  ol  (b,c) 

Y 

Y 

Y 

threshold  -1 

a  1  ol  (b,c) 

Y 

Y 

Y 

threshold  -1-1 

a  <—  2  ol  (b,c) 

N 

N 

Y 

delete  b 

a  c 

N 

Y 

Y 

delete  rule 

none 

N 

N 

N 

Table  1;  Comparison  of  Revisions. 


an  AND-OR  graph.  The  lower  half  of  the  diagram 
shows  a  hypothetical  failing  positive  example  and 
its  partial  proofs.^  From  these  proofs  there  are  four 
possible  repairs  which  will  fix  the  example:  retract 
h,j,n;  retract  h,j,o,p;  retract  k.m;  retract  k,o,p. 
Theory  specialization  follows  a  similar  process  to  re¬ 
turn  sets  of  leaf-rule  retractions  which  fix  individual 
failing  negative  examples. 

2.2  Speeding  Up  EITHER 

We  have  recently  implemented  a  new  version  of 
Either  (Neither)  that  takes  a  different  approach, 
as  shown  in  the  bottom  half  of  Figure  1.  Two 
new  algorithms  form  the  basis  for  the  difference 
between  Either  and  Neither.  First,  calculation 
of  repairs  is  now  achieved  in  linear  time.  Sec¬ 
ond,  all  searches  through  the  theory  (for  deduction, 
antecedent  retraction  and  rule  retraction)  are  op¬ 
timized  in  Neither  to  operate  in  linear  time  by 
marking  the  theory  to  avoid  redundant  subproofs. 
Neither  abandons  the  notion  of  searching  for  all 
partial  proofs  in  favor  of  a  greedy  approach  which 
rapidly  selects  a  single  best  repair  for  each  example. 
The  three  steps  of  the  old  Either  algorithm  can 
then  be  integrated  into  a  single  loop  (see  Figure  1). 

To  illustrate  how  repairs  are  computed  in  linear 
time,  refer  again  to  Figure  2.  Rather  than  comput¬ 
ing  all  partial  proofs,  NEITHER  works  bottom-up, 
constructing  a  single  set  of  retractions.  When  mul¬ 
tiple  options  exist.  Neither  alternates  between  re¬ 
turning  the  smallest  option  and  returning  the  union 
of  the  options,  depending  whether  the  choice  in¬ 
volves  an  AND  or  OR  node.  For  generalization, 
retractions  are  unioned  at  AND  nodes  because  all 
unprovable  antecedents  must  be  removed  to  make 
the  rule  provable.  At  OR  nodes,  only  the  smallest 
set  of  retractions  is  kept  since  only  one  rule  need 
be  provable.  For  specialization,  these  choices  are  re¬ 
versed.  Results  are  unioned  at  OR  nodes  to  disable 
all  rules  which  fire  for  a  faulty  concept.  At  AND 
nodes,  the  smallest  set  of  rule  retractions  is  selected 
since  any  single  failure  will  disable  a  rule. 

As  an  example,  in  Figure  2  the  antecedent  retrac- 

partial  proof  is  one  in  which  some  antecedents 
cannot  be  satisfied. 


tion  calculations  for  the  example  would  begin  at  the 
root  of  the  graph,  recursively  calling  nodes  b  and 
c.  Retraction  for  node  b  then  recurses  on  nodes  d 
and  a.  When  the  recursion  returns  bauJt  to  node 
b  a  choice  must  be  made  between  the  results  from 
nodes  d  and  e  because  node  b  is  an  OR  node.  Since 
the  latter  requires  fewer  retractions,  it  is  chosen  as 
the  return  value  for  node  b.  This  process  continues, 
resulting  in  a  final  repair:  retract  k,n. 

Note  that  this  algorithm  is  linear  in  the  size  of  the 
theory.  No  node  is  visited  more  than  once,  and  the 
computation  for  choosing  among  potential  retrac¬ 
tions  must  traverse  the  length  of  each  rule  at  most 
once.  The  final  repair  is  also  minimum  with  respect 
to  the  various  choices  maule  along  the  way;  it  is  not 
possible  to  find  a  smaller  repair  that  will  satisfy  the 
example.  This  new  algorithm  thus  trades  the  com¬ 
plete  information  available  in  the  partial  proofs  for 
speed  in  computation. 

2.3  Adding  M-of-N  Rules  to  Neither 

With  M-of-N  ru’es,  there  are  six  types  of  revisions 
that  cam  be  made  to  a  theory.  As  before,  antecedents 
may  be  deleted  or  rules  may  Le  added  to  generedize 
the  theory,  and  antecedents  may  be  elided  or  rules 
deleted  to  specialize  the  theory.  The  two  new  revi¬ 
sions  are  to  increase  or  decrease  the  threshold:  de¬ 
creasing  generalizes  a  rule  and  increasing  speciadizes 
it. 

To  incorporate  these  two  new  revisions.  Neither 
must  be  changed  in  four  places.  First,  the  com¬ 
putation  of  a  repair  for  each  fadling  example  must 
take  thresholds  into  account.  For  generalization,  one 
need  only  retr2u:t  enough  antecedents  to  make  the 
rule  provable;  there  is  no  need  to  retract  all  false 
antecedents  if  the  rule  has  a  threshold.  For  exam¬ 
ple,  if  the  rule  for  e  in  Figure  2  had  a  threshold  of 
1  there  would  be  no  need  to  retract  k  to  prove  this 
rule.  A  similar  accounting  for  thresholds  is  required 
for  computing  rule  deletions  for  specialization.  Note 
that  during  generalization  the  threshold  of  each  rule 
from  which  emtecedents  are  retracted  must  be  de¬ 
creased  by  the  number  of  antecedents  retracted  to 
account  for  the  smaller  size  of  the  rule. 

Second,  Neither  must  compute  threshold  re- 
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pairs.  Calculating  threshold  changes  can  be  done 
in  conjunction  with  the  computation  of  antecedent 
and  rule  deletion  repairs  since  it  is  directly  related 
to  how  many  of  antecedents  of  a  nile  are  provable. 
For  generalization,  we  change  the  threshold  to  the 
number  of  antecedents  which  are  provable.  In  spe¬ 
cialization,  we  set  the  threshold  to  one  more  than 
the  number  of  provable  antecedents. 

Third,  a  mechanism  must  be  provided  for  select¬ 
ing  between  a  threshold  change  wd  a  deletion.  Ef¬ 
fectively,  this  amounts  to  deciding  which  type  of  re¬ 
vision  to  try  first.  The  philosophy  used  in  NEITHER 
is  to  try  the  most  aggressive  changes  initially  in  the 
hopes  that  the  resulting  repair  will  cover  more  ex- 
2tmples.  If  the  repair  creates  new  failing  examples, 
the  less  ambitious  repairs  sire  tried  in  turn  with  in¬ 
duction  used  as  a  last  resort.  During  generalization, 
more  radical  repairs  are  those  which  create  more 
general  rules  (i.e.,  rules  which  can  prove  more  ex¬ 
amples).  In  specialization,  the  opposite  is  true.  As 
with  Either,  if  all  changes  result  in  new  failing  ex¬ 
amples,  the  algorithm  falls  back  to  induction  to  learn 
new  rules  or  add  new  antecedents. 

Table  1  compares  equivalent  threshold  and  dele¬ 
tion  changes  for  generalization  and  specialization. 
The  columns  labeled  with  b,  c  and  be  indicate 
whether  the  corresponding  rule  will  conclude  a  when 
just  b,  just  c  or  both  b  and  c  are  true.  Note  that  in 
both  cases,  the  threshold  change  results  in  a  more 
general  rule.  This  means  that  threshold  changes 
should  be  tried  before  2mtecedent  deletions  during 
gener^dization,  but  tried  after  rule  deletions  during 
specialization. 

Fourth  and  finally,  the  induction  component  of 
Neither  must  be  sJtered  slightly  to  accommodate 
threshold  rules.  When  the  application  of  a  repair 
causes  new  failing  examples  to  occur,  NEITHER  re¬ 
sorts  to  induction  as  did  Either.  TK*  result  of  the 
induction  cannot,  however,  simpl-  oe  i'ded  to  the 
theory  as  before.  Table  2  illustr  ■  e  problem. 
The  original  rule  shown  can  be  used  to  prove  both 
the  positive  and  negative  examples,  and  deleting  this 
rule  or  incrementing  its  threshold  only  prevents  the 
positi^e  example  from  being  proved.  Assume  that 
induction  returns  a  new  feature,  d,  which  can  be 
used  to  distinguish  the  two  examples  (i.e.,  d  is  true 
for  the  positive  example  but  false  for  the  negative 
example).  Because  the  original  rule  has  a  thresh¬ 
old,  adding  d  directly  will  still  allow  both  examples 
to  prove  the  rule.  This  problem  remains  even  if 
one  tries  to  increment  the  threshold  in  addition  to 
adding  d.  Instead,  the  rule  must  be  split  by  renam¬ 
ing  the  consequent  of  the  original  rule,  and  creating 
a  new  rule  with  the  renamed  consequent  and  the  re- 


example  features 

pos.  example 

neg.  example 

b,  c,  d 

b,  ->c,  d 

b,  c,  ->d 

orig.  rule 

pos.  example 

neg.  example 

u  *-  1  ot  (b,c) 

Y 

Y 

add  to  rule 

pos.  example 

neg.  example 

a  <—  1  of  (b,c,d) 

Y 

Y 

split  rule 

pos.  example 

neg.  example 

1  *-  1  of  (b.c) 

Y 

Y 

a  X,d 

Y 

N 

Table  2:  Induced  Antecedent  Addition. 


suits  of  induction  as  the  new  rule’s  2uitecedent  list. 

3  Experimental  Results 

3.1  Experimental  Design 

For  the  purposes  of  this  paper,  the  resulting  algo¬ 
rithm  is  labeled  Neither-MofN.  We  tested  both 
Neither  and  Neither-MofN  against  other  clas¬ 
sification  algorithms  using  the  DNA  promoter  se¬ 
quences  data  set  [Towell  et  ai,  1990].  This  data 
set  involves  57  features,  106  examples,  and  2  cate¬ 
gories.  The  theory  provided  with  the  data  set  has 
2U1  initial  classification  accuracy  of  50%.  We  selected 
this  particular  data  set  because  Either  performed 
poorly  on  data  sets  best  modelled  using  M-of-N 
rules.  In  addition  to  testing  Either,  Neither 
and  Neither-MofN,  we  rem  experiments  using 
Id3  [Quinlan,  1986],  backpropagation  [Rumelhart  et 
ai,  1986]  and  Rapture  [M2dioney  md  Mooney,  in 
press]  (a  revision  system  based  on  certainty  factors). 

The  experiments  proceeded  as  follows.  Each  data 
set  was  divided  into  training  and  test  sets.  Training 
sets  were  further  divided  into  subsets,  so  that  the  al¬ 
gorithms  could  be  evaluated  with  varying  amounts 
of  training  data.  After  training,  each  system’s  accu¬ 
racy  was  recorded  on  the  test  set.  To  reduce  statisti¬ 
cal  fluctuations,  the  results  of  this  process  of  dividing 
the  examples,  training,  and  testing  were  averaged 
over  25  runs.  The  random  seeds  for  the  backpropa¬ 
gation  algorithm  were  reset  for  each  run.  Training 
time,  and  test  set  accuracy  were  recorded  for  each 
run.  Statistical  significance  was  measured  using  a 
Student  t-test  for  paired  difference  of  means  at  the 
0.05  level  of  confidence  (i.e.,  95%  certainty  that  the 
differences  are  not  due  to  random  chance). 

3.2  Results 

The  results  of  our  experiments  are  shown  in  the  three 
graphs  of  Figures  3,  4  and  5.  Figure  3  compares 
the  leeirning  curves  of  the  systems  tested,  show¬ 
ing  how  predictive  accuracy  on  the  test  set  changes 


Figure  3:  Test  Set  Accuracy 


as  a  function  of  the  number  of  training  examples. 
As  can  be  seen  Neither-MofN’s  performance  was 
significantly  better  than  all  other  systems  except 
Rapture  and  Kbann.^  Rapture  out-performed 
Neither-MofN  with  small  numbers  of  training 
examples  but  their  accuracy  was  comparable  with 
larger  inputs.  Neither ’s  accuracy  was  on  par  with 
badcpropagation,  but  was  lower  than  EITHER  for 
small  training  sets  and  higher  than  Either  for  large 
training  sets.  Note,  that  Figure  3  is  not  direct  com¬ 
parison  of  Neither  and  I^ann  since  the  results 
reported  were  compiled  from  different  subsets  of  the 
DNA  promoter  sequences  data  set.  Id3  had  signifi¬ 
cantly  lower  accuracy  than  the  other  systems. 

Figure  4  shows  a  comparison  of  training  times. 
Both  Neither-MofN  and  Neither  were  more 
than  an  order  of  magnitude  faster  than  backprop- 
agation  sind  Either.  Only  Id3  ran  faster  than 
Neither-MofN. 

We  also  collected  data  on  the  average  complexity 
of  the  revised  theories  produced  by  both  NEITHER 
and  Neither-MofN.  Complexity  was  measured  as 
the  total  size;  i.e.,  the  total  number  of  all  literals 
in  the  theory.  The  results  are  shown  in  Figure  5. 
As  can  be  seen  from  this  graph,  Neither-MofN 
not  only  produces  less  complex  resulting  theories  but 
also  produces  theories  closer  in  size  to  the  original. 


Figure  4:  Training  Time  Comparison 


UM* 


^Technically,  the  last  difference  between  backpropa- 
gation  and  Neither-MofN  was  only  significant  at  the 
0.1  level. 


Figure  5:  Concept  Complexity 


3.3  Discussion 

Many  of  our  expectations  were  borne  out  by 
the  experimental  results.  Both  Neither  and 
Neither-MofN  ran  more  than  an  order  of  mag¬ 
nitude  faster  than  Either  due  to  the  optimized  al¬ 
gorithms  discussed  in  section  2.  Neither-MofN’s 
increase  in  accuracy  was  also  expected  since  the 
new  algorithm  is  able  to  concentrate  on  making 
M-of-N  revisions  directly.  Also,  the  fact  that 
Neither- MofN  generates  less  complex  theories 
is  not  surprising,  again  because  it  can  directly 
modify  threshold  values  rather  than  create  new 
rules.  In  short,  by  adding  one  more  operator 
to  the  generalization  and  specialization  processes, 
Neither-MofN  is  able  to  accurately  revise  a  the¬ 
ory  known  to  be  difficult  for  symbolic  systems,  with¬ 
out  having  to  sacrifice  the  efliciency  of  a  symbolic 
approach.  Finally,  the  most  comparable  learning- 
curve  results  from  [Towell,  1991]  would  indicate  that 
Kbann ’s  accuracy  in  the  promoter  domain  is  about 
the  same  as  Neither-MofN’s. 

4  Related  Work 

Several  resesirchers  have  developed  methods  for  in¬ 
ducing  M-of-N  concepts  from  scratch.  CRLS  (Spack- 
man,  1988]  learns  M-of-N  rules  and  out-performed 
standard  rule  induction  in  several  medical  domains. 
ID-2-of-3  (Murphy  and  Pazzani,  1991]  incorporates 
M-of-N  tests  in  decision-tree  learning  and  out¬ 
performed  standard  decision-tree  induction  in  a 
number  of  domains.  Both  projects  clearly  demon¬ 
strate  the  advantages  of  M-of-N  rules. 

Seek2  [Ginsberg  ci  ai,  1988]  includes  operators 
for  refining  M-of-N  rules;  however,  its  revision  pro¬ 
cess  is  heuristic  and  it  is  not  guaranteed  to  produce 
a  revised  theory  that  is  consistent  with  all  of  the 
training  examples.  Neither  uses  a  greedy  covering 
approsxh  to  guarantee  that  it  finds  a  set  of  revisions 
that  fix  all  of  the  misclassified  examples  in  the  train¬ 
ing  set.  Also,  unlike  Neither,  Seek2  cannot  learn 
new  rules  or  add  new  antecedents  to  existing  rules. 

Kbann  [Towell  and  Shavlik,  1992]  revises  a  the¬ 
ory  by  translating  it  into  a  neural  network,  using 
backpropagation  to  refine  the  weights,  and  then 
retranslating  the  result  back  into  symbolic  rules. 
Neither’s  symbolic  revision  process  is  much  more 
direct  2md,  from  all  indications,  significantly  faster. 
Although  Kbann ’s  results  are  referred  to  as  M-of-N 
rules,  they  actually  contain  real- valued  antecedent 
weights  and  therefore  are  not  strictly  M-of-N.  In  ad¬ 
dition,  Kbann ’s  revised  theories  for  the  promoter 
problem  are  also  more  complex  in  terms  of  number 
of  antecedents  thzm  the  initial  theory  [Towell,  1991], 


while  Neither  actually  produces  a  slight  reduction. 
Therefore,  Neither’s  revised  theories  are  less  com¬ 
plex  and  presumably  easier  to  understand.  Finally, 
unlike  Kbann,  NEITHER  is  guaranteed  to  converge 
to  100%  accuracy  on  the  training  data. 

Rapture  [Mahoney  and  Mooney,  1992]  uses  a 
combination  of  symbolic  and  neural-network  lean¬ 
ing  methods  to  revise  a  certainty-factor  rulebase 
(Bucheman  and  E.H.  Shortliffe,  1984].  Consequently, 
it  lies  somewhere  between  NEITHER  and  Kbann 
on  the  symbolic-connectionist  dimension.  As  illus¬ 
trated  in  the  results,  its  accuracy  on  the  promoter 
problem  is  only  slightly  superior  to  Neither’s. 
However,  its  real- valued  certzdnty  factors  make  its 
rules  more  complex. 

5  Future  Work 

The  current  version  of  NEITHER  needs  to  be  en¬ 
hanced  to  handle  a  number  of  issues.  We  need 
to  incorporate  a  number  of  advanced  features  from 
Either,  such  as  constructive  induction,  modifica¬ 
tion  of  higher-level  rules,  and  the  ability  to  handle 
numericzJ  features  and  noisy  data.  Also,  we  could 
to  extend  our  methods  to  handle  negation  as  fail¬ 
ure  and  incorporate  the  ability  to  handle  M-of-N 
rules  into  first-order  theory  revision  [Richards  and 
Mooney,  1991].  Finally,  we  need  to  perform  a  more 
comprehensive  experimented  evaluation  of  the  sys¬ 
tem. 

6  Conclusions 

This  paper  has  presented  an  efficient  propositional 
theory  refinement  system  that  is  capable  of  revising 
M-of-N  rules.  The  basic  framework  is  a  modification 
of  Either  [Ourston  and  Mooney,  1990];  however, 
the  construction  of  partied  proofs  has  been  reduced 
from  exponentied  to  linear  time  and  a  method  for 
revising  the  thresholds  of  M-of-N  rules  has  been  in¬ 
corporated.  The  resulting  system  runs  more  than  an 
order  of  magnitude  faster  and  produces  significantly 
more  accurate  results  in  domains  requiring  partiad 
matching,  such  as  the  problem  of  recognizing  pro¬ 
moters  in  DNA. 
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Abstract 

This  paper  presents  an  integrated  heuristic  ap¬ 
proach  to  knowledge  base  refinement  which 
is  viewed  as  a  supervised  validation  of 
plausible  reasoning.  The  approach  integrates 
multistrategy  learning  based  on  multi  type 
inference,  active  experimentation,  and  guided 
knowledge  elicitation.  One  of  the  main 
features  of  this  approach  is  that  once  the 
knowledge  base  has  been  refined  to  deduc¬ 
tively  entail  a  new  piece  of  knowledge,  it  can 
be  easily  further  refined  to  deductively  entail 
many  o&er  similar  pieces  of  knowledge. 

Keywords:  multistrategy  learning,  knowl¬ 
edge  acquisition,  plausible  reasoning 


1.  Introduction 

An  expert  system  consisting  of  an  incomplete 
and  partially  incorrect  knowledge  base  (KB), 
and  of  a  deductive  inference  engine,  suffers 
from  two  major  limitations: 

•  it  is  not  able  to  solve  some  problems  from 
its  domain  of  expertise  (because  the  KB  is 
incomplete); 

•  the  solutions  proposed  might  be  incorrect 
(because  the  KB  is  partially  incorrect). 

The  set  of  problems  which  such  a  system 
could  solve  is  the  deductive  closure  of  the 
knowledge  base  (DC).  In  the  case  of  a 
theorem  prover,  it  is  Ae  set  of  facts  which 
could  be  deductively  inferred  from  the  KB. 

Thatis,  DC  =  {I:KB\=  1} 

where  “  means  deductive  entailment 
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Figure  1  shows  the  relationship  between  the 
deductive  closure  DC  of  the  imperfect  KB  of 
an  expert  system  and  the  set  of  true  facts  in 
the  application  domain  (TQ: 

•  DC  and  T  C  are  “crisp”  sets,  with  cleanly 
defined  borders.  That  is,  the  system  has  an 
algorithm  for  testing  the  membership  of  a 
statement  in  DC,  and  presumably  a  human 
expert  can  perform  a  similar  test  on  TC. 

•  DC  n  TC  represents  the  set  of  facts  which 
are  deductively  entailed  by  the  KB  and  are 
true.  This  shows  that  there  is  useful  and 
correct  knowledge  encoded  into  the  facts 
and  the  deductive  rules  of  the  KB. 

•  DC  -  TC  represents  the  set  of  facts  which 
are  deductively  entailed  by  the  KB  but  are 
false.  This  shows  that  there  are  errors  in  the 
set  of  facts  and  deductive  rules. 

•  TC- DC  represents  the  set  of  facts  which 
are  true  but  are  not  deductively  entailed  by 
the  KB.  This  shows  that  the  set  of  facts  and 
deductive  rules  is  incomplete. 


Figure  1:  The  relationship  between 
DC  and  TC. 

The  goal  of  KB  refinement  is  to  improve  the 


knowledge  base  so  that  DC  becomes  a  good 
approximation  of  TC.  As  a  result,  the  KB 
would  become  an  almost  complete  and 
correct  one,  and  the  expert  system  would  be 
able  to  correctly  solve  most  of  the  problems 
from  its  domain  of  expertise. 

Many  of  the  current  KB  refinement  systems 
such  as  ANA-EBL  (Cohen,  1991),  CLINT 
(De  Raedt  and  Bruynooghe,  1993),  DUCTOR 
(Cain,  1991),  EITHER  Rooney  and  Ourston, 
1993),  FORTE  (Richards  and  Mooney,  1993), 
SEEK  (Ginsberg,  Weiss,  and  Politakis,  1988), 
try  to  partially  generalize  the  KB  so  as  to 
cover  mote  of  TC,  and  to  partially  specialize 
it,  so  as  cover  less  of  DC  -  TC.  In  the  case  of 
a  Prolog-like  KB,  this  is  accomplished  by 
generalizing  and/or  specializing  some  of  the 
rules,  as  well  as  by  introducing  new  facts  into 
the  and/or  removing  other  facts. 

In  this  paper,  we  are  also  addressing  the 
problem  of  correcting  and  extending  DC  so  as 
to  better  approximate  TC.  However,  our 
approach,  as  opposed  to  the  approaches  cited 
above,  brings  a  new  set  into  play,  the 
plausible  closure  of  a  KB,  and  proposes  a 
different  perspective  to  the  KB  refinement 
problem. 


Let  us  consider,  for  instance,  the  rule 
V,[P(x)-^G(x)]. 

If  one  knows  that  P(a)  is  true,  then  one  may 
deductively  infer  Qia}: 

{P(a),V,[P(x)-^(2(x)]}t=  Qia) 

If  one  knows  that  Q(a)  is  true,  then  one  may 
abductively  (Pople,  1973;  Josephson,  1991) 
infer  P(a): 

{e(a).V,[/>(i)^e(j:)]}l=  Pia) 

If  one  knows  that  P(a)  is  true,  and  b  is  similar 
to  a,  then  one  may  analogically  (Carbonell, 
1986;  Genmer,  1^0;  Kedar-Cabelli,  1988; 
Kodratoff,  1990;  Winston,  1980)  infer  Q(b)  \ 


P(a), 


■  v,[P(x)-»(2(^)],  [ 

(“/>  ‘similar’  to  a')j 


N  Qib) 


Another  way  to  make  plausible  inferences  is 
to  use  weaker  correlations  between 
knowledge  pieces  (e.g.  related  facts, 
determinations,  dependencies,  “A  is  like 
statements,  etc.). 


2.  The  plausible  closure  of  the  KB 

We  are  assuming  that  the  initial  incomplete 
and  partially  incorrect  KB  consists  of  facts 
and  rules  expressed  in  first-order  logic. 
However,  the  rules  are  not  restricted  to  be 
deductive.  They  might  also  be  weaker 
correlations  as  determinations  (Davies  and 
Russell,  1987;  Russell,  1989),  mutual 
dependencies  (Michalski,  1993),  etc.  This  is 
so  for  allowing  the  introduction  of  all  sorts  of 
relevant  knowledge  into  the  KB. 

The  plausible  closure  of  the  KB  (PC)  is  de- 
fmed  as  the  set  of  problems  which  a  plausible 
inference  engine  could  solve.  In  the  case  of  a 
theorem  prover,  it  is  the  set  of  facts  which 
could  be  plausibly  inferred  from  the  KB. 

Thatis,  PC  =  {/:KBN/} 

where  “N“  means  plausible  entailment 

One  way  to  make  plausible  inferences  is  to 
use  the  rules  from  the  KB  not  only 
deductively,  but  also  abductively  or 
analogically. 


Let  us  consider,  for  instance,  that  the  KB 
contains  the  following  related  facts  (each  set 
describing  an  object): 

T(c)a0(c)a/?(c) 

Pid)A.Q{d)ASid) 

Pie)AQ{e) 

Then  one  might  empirically  generalize 
(Mitchell,  1978;  Michalski,  1983;  (Quinlan, 
1986)  these  sets  of  facts  to  the  rule 

VjP(x)->e(^)] 

and  might  deductively  use  this  rule  with  the 
fact  P(a)  to  predict  that  Q(a)  is  also  true. 

Analogical  inferences  could  be  made  by  em¬ 
ploying  plausible  determinations  (Russell, 
1989;  Tecuci,  1993).  Let  us  consider,  for 
instance,  the  following  determination  rule 
stating  that  U  plausibly  determines  V: 

Uix,y)^Vix,z) 

Then  one  may  make  the  following  analogical 
inference: 

I/(s,a)A  V(s,b)A(/(r,a)N  V{t,b) 
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Another  example  of  plausible  inference  is  the 
‘hisefiQ  analogical  inference”,  introduced  by 
Greiner  (1988).  Let  us  suppose,  for  instance, 
that  the  system  is  told  that  Q(b)  is  true,  and  is 
given  the  analogical  hint  "b  is  like  a'\  in 
order  to  show  that  KB  H  Q(b).  That  is, 
without  this  analogical  hint,  the  system  is  not 
able  to  show  that  KB  N  Q(b).  Based  on  this 
analogical  hint,  the  system  is  looking  for  a 
feature  of  a  (e.g.  P(a) )  which,  if  possessed  by 
b,  would  allow  it  to  prove  KB  H  Q(Jb)  : 

KB  1=  P(a) 

KB  P(b) 

KB  y  -,P(b) 

{P(b),  KB}  y  Q(b) 

As  a  result  of  this  reasoning  P(b)  is  asserted 
into  the  KB. 

Several  other  types  of  plausible  derivations 
based  on  implications  and  dependencies  are 
described  in  (Collins  and  Michalski,  1989). 

In  order  to  show  that  a  certain  fact,  /,  is 
plausibly  entailed  by  the  KB,  a  system  is  not 
restricted  to  making  only  one  plausible 
inference.  In  general,  it  could  build  a  plausible 
justification  tree  ^ecuci,  1993).  A  plausible 
justification  tree  is  like  a  proof  tree,  except 
that  the  inferences  which  compose  it  may  be 
the  result  of  different  types  of  reasoning  (not 
only  deductive,  but  also  analogical,  abductive, 
predictive,  etc.).  An  example  of  such  a  tree  is 
presented  in  Figure  5. 

One  of  the  main  reasons  for  illustrating  the 
above  plausible  inferences  was  to  show  that, 
by  employing  a  plausible  inference  engine, 
one  could  signiHcantly  extend  the  set  of 
problems  that  could  be  solved  by  a  system. 

Figure  2  presents  our  conjecture  about  the 
relationships  between  the  plausible  closure  of 
the  KB,  the  deductive  closure  of  the  KB,  and 
the  set  of  true  facts  in  the  application  domain: 

•  PC  is  a  “soft”  set,  the  boundaries  of  which 
are  not  strictly  defined.  Indeed,  depending 
of  the  number  and  of  the  strength  of  the 
different  types  of  plausible  reasoning  steps 
in  a  justification  tree  for  a  fact  F,  the 
plausibility  of  F  is  higher  or  lower. 

•  PC  ZD  DC  because  the  deductive  proof 
trees  are  special  cases  of  plausible 


justification  trees. 

•  PCnTC  represents  the  set  of  facts  which 
are  plausibly  entailed  by  the  KB  and  are 
true. 

•  PCr\TC-DC  represents  the  set  of  true 
facts  which  are  plausibly  entailed  by  the 
KB,  but  are  not  deductively  entailed  by  the 
KB.  Our  hypothesis  is  that  this  is  a 
significantly  large  set. 

•  TC-PC  represents  the  set  of  true  facts 
that  are  not  plausibly  entailed  by  the  KB. 
Although  this  set  is  not  well  defined,  it 
expresses  the  intuition  that  there  are  true 
facts  which  even  a  plausible  inference  en¬ 
gine  could  not  derive  from  the  current  KB. 


Figure  2:  The  relationship 
between  DC,  PC,  and  TC. 


The  deductive  and  plausible  closures  are  two 
approximations  of  truth.  In  the  approach  we 
are  proposing,  we  are  considering  f>C  as 
being  an  approximate  lower  bound  for  TC, 
and  PC  as  being  an  approximate  upper  bound 
for  TC.  With  this  interpretation,  the  KB 
refinement  problem  reduces  to  one  of 
determining  the  set  TC  in  the  plausible  space 
defined  by  DC  and  PC.  More  precisely, 
during  KB  refinement,  DC  will  be  extended 
with  a  significant  portion  of  PCc^TC,  and 
will  also  be  corrected  to  remove  from  it  most 
of  DC -TC.  Consequently,  as  a  result  of  this 
process,  DC  will  become  a  good  approxi¬ 
mation  of  TC. 

Otherwise  stated,  we  propose  cm.  approach  to 
KB  refinement  which  is  viewed  as  a  transfer 
of  knowledge  form  the  plausible  closure  to  the 
deductive  closure. 

In  this  paper  we  are  proposing  a  heuristic 
method  which  is  an  effective  way  of 
extending  DC  with  a  significant  portion  of 


PCr\TC.  More  precisely,  during  knowledge 
reHnement,  the  sets  D  C  and  PC  are 
transformed  as  follows: 

•  DC  is  extended  by  acquiring  new  facts  or 
rules,  or  by  generalizing  some  of  the  rules; 

•  DC  is  improved  by  specializing  some  of  the 
deductive  rules  which  could  be  partially 
incorrect; 

•PC  is  extended  and/or  improved  by 
acquiring  new  facts  or  rules,  or  improving 
some  of  the  existing  rules. 

The  next  section  contains  a  general  presen¬ 
tation  of  the  proposed  KB  refinement  method. 

3.  General  presentation  of  the  KB 
refinement  method 

The  KB  of  the  system  is  assumed  to  be 
incomplete  and  partially  incorrect.  The  KB  is 
improved  during  training  sessions  with  a 
human  expert  who  provides  the  system  with 
new  input  information  I.  Each  such  input  /  is 
an  example  of  an  answer  that  the  final  expert 
system  should  be  able  to  generate,  that  is,  I 
should  be  in  the  deductive  closure  of  the  finai 
expert  system.  The  goal  of  KB  refinement  is 
to  improve  the  KB  of  the  system  so  that  to 
answer  questions  as  the  human  expert. 

If  an  input  /  is  already  in  the  plausible  closure 
of  the  KB,  then  the  system  will  be  able  to 
make  a  significant  transfer  of  knowledge 
from  the  plausible  closure  to  the  deductive 
closure.  More  precisely,  it  will  extend  the  de¬ 


ductive  closure  with  new  hypotheses,  //,  so  as 
to  include  a  generalization  /g  of  /,  that  is 

At  the  same  time,  it  will  extend  the  plausible 
closure  of  the  KB,  so  as  to  include  more  of 
TC,  and  might  also  remove  some 
inconsistencies  from  the  deductive  closure, 
reducing  the  size  of  the  set  DC  -  TC. 

If  /  is  not  in  the  plausible  closure  of  the  KB, 
then  it  will  be  simply  asserted  into  the  KB. 
This  has  the  effect  of  extending  both  DC  and 
PC.  Indeed,  the  presence  of  /  in  DC  may 
make  it  possible  for  the  system  to  show  that 
other  facts  (e.g.  /;.  1 2)  are  deductively  or 
plausibly  entailed  by  the  KB: 

KB  ^  /,.  but  {/.  KB}^  /, 

KB  but  {/,  OjN/j 

This  also  shows  that  during  KB  refinement, 
PC  may  grow  to  include  facts  from  TC- PC. 

The  main  stages  of  the  KB  refinement  process 
are  presented  in  Figure  3.  They  are: 

•  multitype  inference  and  generalization; 

•  experimentation,  verification  and  repair; 

•  goal-driven  knowledge  elicitation. 

In  the  first  stage,  the  system  analyzes  the 
input  in  terms  of  its  current  knowledge  by 
building  a  plausible  justification  tree  which 
demonstrates  that  the  input  is  a  plausible 
consequence  of  the  system's  current 
knowledge. 


Figure  3:  The  main  stages  of  the  KB  refinement  process 
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As  a  result  of  the  analysis  of  the  input  via 
multitype  inferences,  new  pieces  of 
knowledge  are  hypothesized  (through 
analogy,  abduction,  inductive  generalization 
and  prediction,  etc.),  and  existing  pieces  of 
knowledge  are  improved  so  that  the  extended 
knowledge  base  to  entail  the  input.  By 
asserting  these  pieces  of  knowledge  into  the 
knowledge  base,  the  system  is  able  to 
deductively  entail  the  input.  The  support  for 
these  new  pieces  of  knowledge  is  that  they 
allow  building  a  logical  connection  (the 
justification  tree)  between  a  knowledge  base 
that  represents  a  part  of  the  real  world,  and  a 
piece  of  knowledge  (the  input)  that  is  known 
to  be  true  in  the  real  world. 

Next,  the  system  will  generalize  the  plausible 
justification  tree,  by  employing  different 
types  of  generalizations  (not  only  deductive 
or  empirical,  but  also  based  on  analogy  and, 
possibly,  on  orher  types  of  inferences).  By 
this,  it  will  generalize  the  hypothesized 
knowledge,  so  that  the  resulting  knowledge 
base  will  entail  not  only  the  received  input  /, 
but  also  a  generalization  of  it  Ig. 

The  generalized  plausible  justification  tree 
shows  how  Ig  (a  generalization  cf  the  input  /) 
is  entailed  by  the  KB.  However,  this  tree  was 
obtained  by  making  both  plausible  inferences 
and  plausible  generalizations.  Consequently, 
both  the  tree  and  the  corresponding 
knowledge  pieces  learned  are  less  certain. 
One  may  improve  them  by  performing 
experiments.  TTiis  is  the  second  stage  of  the 
KB  refinement  process.  During  this  stage,  the 
system  will  generate  instances  of  Ig  and  will 
ask  the  user  if  they  are  true  or  false.  It  will 
further  improve  the  hypothesized  knowledge 
pieces  so  Aat  the  updated  KB  to  deductively 
entail  the  instances  of  Ig  which  are  true  and 
to  reject  the  ones  which  are  false. 

However,  because  the  KB  is  incomplete  and 
possibly  partially  incorrect,  some  the 
learned  knowledge  pieces  may  be  incon.  £nt 
(i.e.  may  cover  negative  examples).  In  order 
to  remove  such  inconsistencies,  additional 
knowledge  pieces  (which  represent  new  terms 
in  the  representation  language  of  the  system) 
are  elicited  from  the  expert,  through  several 
consistency- driven  knowledge  elicitation 
techniques.  This  represents  the  third  stage  of 
the  KB  refinement  process. 


The  entire  knowledge  refinement  process  is 
characterized  by  a  cooperation  between  the 
learning  system  and  the  human  expert  in 
which  the  learner  performs  most  of  the  tasks 
and  the  expert  helps  it  in  solving  the  problems 
that  are  intrinsically  difficult  for  a  learner 
(e.g.,  the  credit/blame  assignment  problem, 
the  problem  of  new  terms)  and  relatively  easy 
for  the  human  expert. 

4.  Exemplary  application  domain 

We  will  use  the  domain  of  workstation 
allocation  and  configuration  in  order  to 
illustrate  this  KB  refinement  method.  The 
expert  system  to  be  built  has  to  reason  about 
which  machines  are  suitable  for  which  tasks 
and  to  allocate  an  appropriate  machine  for 
each  task. 

The  initial  (incomplete  and  partially 
incorrect)  knowledge  base  contains 
information  about  various  printers  and 
workstations  distributed  throughout  the 
workplace.  A  sample  of  this  knowledge  base 
is  presented  in  Figure  4.  Notice  that  it 
contains  different  types  of  knowledge: 
deductive  rules,  a  plausible  determination 
(Russell,  1989;  Tecuci,  1993),  facts,  and 
hierarchies.  Each  of  these  knowledge  pieces 
might  be  incomplete  and/or  partially 
incorrect. 

Let  us  suppose  that  the  system  is  told  that 
macn02  is  suitable  for  publishing 

suitable(macII02,  publishing) 

and  this  fact  is  representative  of  the  type  of 
answers  it  should  be  able  to  provide. 

5.  Multitype  inference  and 
generalization 

The  system  tries  to  analyze  (“understand”) 
the  input  in  terms  of  its  current  knowledge  by 
building  the  plausible  justification  tree  in 
Figure  5.  Such  a  tree  demonstrates  that  the 
input  is  a  plausible  consequence  of  the 
system's  current  knowledge.  TTie  method  for 
building  such  a  tree  is  presented  in  (Tecuci, 
1993).  It  employs  a  backward  chaining 
uniform-cost  search. 

The  tree  in  Figure  5  is  composed  of  four 
deductions,  an  inductive  prediction,  and  a 
determination-based  analogical  implication. 
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siiiUble(X,  publishing) ;  X  is  suitable  for  publishing  if  it  runs 
runs(X,  publishing— sw),  coinmunicate(X.  Y),  ;  publishing  software  and  ctuninunicates 

isa(Y,  high-quality-printer).  ;  with  a  high  quality  printer 

communicatefX,  Y) :-  ;  X  and  Y  communicate  if  they  are  on  the  same  network 

on(X,  Z),  on(Y,  Z). 

c(»nmunicate(X,  Y) :-  ;  X  and  Y  communicate  if  they  are  on  connected  networks 

on(X,  Z),  on(Y,  V),  coanect(Z,  V). 

isa(X,  higb-quality-printer) :-  ;  X  is  a  high  quality  printer  if 

isa(X,  printer),  speed(X,  high),  resolutioa(X,  high).  ;  it  has  high  speed  and  resolution 

runs(X,  Y)  os(X,  Z).  ;  the  type  of  software  which  a  machine  could  tun  is  largely  determined 

;  by  its  operating  system  means  plausible  determination) 

runs(X,  Y)  iuns(X,  Z),  isa(Z,  Y). 

os(sun01,  Unix).  on(sun01,  fddi).  speed(sun01,  high).  processorfsunOl,  rise). 

;  sunOl's  operating  system  is  unix,  it  is  on  the  fddi  network,  has  high  speed  and  a  rise  processor 

os(hpOS,  Unix).  on(hpOS,  ethemet).  speed(hp05,  high).  processor(hp05,  rise).  runs(hpOS,  frame-maker). 


os(inacplus07,  mac-os). 
os(macn02,  mac-os). 
os(maclc03,  mac-os). 
on(proprintei01,  ethemet). 
on(lasetjet01,  fddi). 
on(microlasei03,  ethemet). 
resolution(xerox01,  high). 
connect(appletalk,  ethemet). 


<m(inacplus07,  appletalk). 
on(tnacn02,  appletalk). 
runs(maclc03,  page-maker). 
iesolution(proprintei01,  high). 
iesolution(laseijet01,  high). 
resolution(microlasei03,  high). 
speed(xerox01,  high). 
connect(appletalk,  fddi). 

_  pronrinter  — 

^  ^ - laserwriter 

"•* —  microlaser — 


» workstation 


processor(i»-oprinter,  rise). 
processor(laseijet01,  rise). 
processor(microlasei03,  rise). 
processor(xerox01,  rise). 
connect(fddi,  ethemet). 

- pn^rinterO  1 

'  microlaser03 

- laseijetOl 

- xeroxOl 

■  ■  —  macplus - 

Z  — -  made  ■■■■■•  ■ 


'  macplusO? 
’  maclc03 
'  macII02 
-  sunOl 
-hp05 


something  | 


■■  software 


^iwocessor- 


metwoik 


op-system-*^ 

accounting 

spreadsheet 

>publishing-sw 


mac-os 

mac-write 

page-maker 

frame-maker 

microsoft-word 


ethemet 

af^letalk 


Figure  4:  Sample  of  an  incomplete  and  partially  incorrect  KB 
for  the  domain  of  workstation  aF  cation  and  configuration. 
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Figure  5:  A  plausible  justiAcation  tree  for  “suitable(macn02,  publishing)”. 


The  inductive  prediction 

processor(prq)rintei01,risc)  speed(proprinter01,higb) 

was  made  by  empirically  generalizing  the 
facts: 

speed(sun01,  high),  os(sun01,  unix),  on(sun01,  fddi), 
processoifsunOl,  rise). 

speed(hp05,  high).  os(hp05,  unix),  on(hpOS,  ethemet), 
processor(hp05,  rise),  runs(hp05,  frame-maker). 

speed(xerox01,  high),  resolution(xerox01,  high), 
proeessor(xerox01,  rise). 

to  the  rule 

speed(x,  high) processor(x,  rise) 
and  then  applying  this  rule  deductively. 

An  open  problem  is  how  to  collect  the  facts  to 
be  generalized,  and  what  kind  of 
generalization  to  look  for.  One  solution  which 
we  are  investigating  is  based  on  CLINT's 
approach  (De  Raedt,  1991)  of  using  a 
hierarchy  of  languages  for  the  rules  to  be 
learned.  Each  language  is  characterized  by  a 
certain  form  of  the  rules  to  be  learned,  which 
suggests  the  kind  of  facts  to  look  for. 

The  analogical  inference  was  made  by  using 
the  determination  rule 

ninsCX,  Y)  os(X,  Z) 
as  indicated  in  Figure  6. 

While  there  may  be  several  justification  trees 
for  a  given  input,  the  attempt  is  to  find  the 
most  simple  and  the  most  plausible  one  (Lee, 
1993).  This  tree  shows  how  a  true  fact  I 
derives  from  other  true  facts  from  the  KB. 
Based  on  the  Occam's  razor  (Blumer, 
Ehienfeucht,  Haussler,  and  Warmuth,  1987), 
and  of  the  general  hypothesis  used  in 
abduction  which  states  that  the  best 
explanation  of  a  true  fact  is  most  likely  to  be 


true  (Peirce,  1965),  one  could  assume  that  all 
the  inferences  from  the  most  simple  and 
plausible  justification  tree  are  correct. 

With  this  assumption,  the  KB  is  improved  by; 

•  learning  a  new  rule  by  empirical  inductive 
generalization: 

speed(X,  high) processorfX,  rise) 
with  the  positive  examples 
X=sun01,  X=hp05,  X=xerox01,  X=proprinter01 

•  discovering  positive  examples  of  the 
determination  rule  (which  is  therefore 
enforced): 

nins(X,  Y) os(X,  Z). 
with  the  positive  examples 
X=maclc03,  Y=page-maker,  Z=mac-os 
X=macI102,  Y=page-maker,  Z=mac-os 

•  discovering  positive  examples  for  the 
deductive  rules  used  in  building  the 
plausible  justification  tree  as,  for  instance: 

suitable(X,  publishing) 
n]ns(X,  publishing-sw),  communicate(X,  Y), 
isa(Y,  high-quality-printer). 
with  the  positive  example 
X=macII02,  Y=proprintei01 

Therefore,  the  user  merely  verifying  a 
statement  allows  the  system  to  refine  the  KB 
by  making  several  justified  hypotheses.  As  a 
result  of  these  improvements 

KB  ^  suitable(macn02,  publishing) 

During  KB  refinement,  the  rules  are 
constantly  updated  so  as  to  remain  consistent 
with  the  accumulated  examples.  This  is  a  type 
of  incremental  learning  with  full  memory  of 
past  examples. 

As  mentioned  before,  the  input  fact 
“suitable(macII02,  publishing)”  is  representa¬ 
tive  for  the  kind  of  answers  Ae  final  system 


should  be  able  to  generate.  This  means  that 
the  final  system  should  be  acic  to  give  other 
answers  of  the  form  “suitable(x,  y)”.  It  is 
therefore  desirable  to  extend  DC  so  as  to 
include  other  such  true  facts,  but  also  to 
improve  DC  so  as  no  longer  to  include  false 
facts  of  the  same  form. 

While  the  integration  of  the  fact 
“suitable(macn02,  publishing)”  into  DC  was 
a  costly  process  that  involved  multitype 
inferences  and  the  determination  of  the  most 
plausible  justification  tree,  the  integration  in 
(or  exclusion  from)  DC  of  similar  facts  is  a 
much  simpler  process  which  basically 
replicates  most  of  the  reasoning  involved  in 
the  “understanding”  of  the  input 
“suitable(macn02,  publishing)”.  This  feature 
is  one  of  the  main  strengths  of  the  proposed 
KB  refinement  method. 

The  basic  idea  is  the  following  one.  One 
performs  a  costly  reasoning  process  to  show 
that  KB^  7.  Then  it  computes  a  general¬ 
ization  of  that  reasoning  so  as  to  speed  up 
future  problem  solving  which  requires  a 
similar  reasoning.  Indeed,  such  a  reasoning 
process  could  be  generated  by  simply 
instantiating  this  generalization  to  the  new 
problem  to  be  solved. 

A  simple  illustration  of  this  idea  is 
explanation-based  learning  (Mitchell,  Keller, 
Kedar-Cabelli,  1986;  Delong  and  Mooney, 
1986).  In  this  case,  an  explanation  (proof  tree) 
of  a  concept  example  is  deductively  gener¬ 
alized.  Different  instances  of  this  deductive 
generalization  demonstrate  that  other  descrip¬ 
tions  are  examples  of  the  same  concept 

In  our  method,  each  inference  from  the 
plausible  justification  tree  is  replaced  with  a 
generalization  which  depends  of  the  type  of 
inference,  as  shown  in  (Tecuci,  1993).  Thus, 
the  system  is  performing  not  only  deductive 
generalizations,  but  also  empiricd  inductive 
generalizations,  generalizations  based  on  dif¬ 
ferent  types  of  analogies,  and  possibly,  even 
generalizations  based  on  abduction.  To  illus¬ 
trate  this,  let  us  consider  the  analogical  impli¬ 
cation  from  Figure  5.  The  process  of  mal^g 
this  inference  is  illustrated  in  Figure  6. 

According  to  the  plausible  determination  rule 
“runs(X,  Y)  os(X,  Z)”,  the  software  which 
a  machine  can  run  is  largely  determined  by  its 
operating  system.  It  is  known  that  the  oper¬ 


ating  system  of  “maclc03”  is  “mac-os”,  and 
that  it  runs  “page-maker”.  Because  the 
operating  system  of  “macn02”  is  also  “mac- 
os”,  one  may  infer  by  analogy  that  “macn02” 
could  also  run  “page-maker”. 


os(inM:lc03,  mac-os) 

i 


sraila 


os(iiiacn02,  mac-os) 

detcmunes 


determiDes 

similar  T 

nins(maclc03,  page-maker)  ^  nins(macQ02,  page-maker) 


runs(macn02,  page-maker) 
analogy 


os(inaclc03,  mac-os) 


os(macU02,  mac-os) 


njLns(maclc03,  page-maker) 


Figure  6:  Inferring 

“runs(macII02,  page-maker)”  by  analogy. 

Let  us  notice  now  that  the  same  kind  of 
reasoning  is  valid  for  any  type  of  operating 
system,  and  for  any  type  of  software,  as 
illustrated  in  Figure  7. 

os(Xl.Zl)^ 
dctcniuncs 

nios(Xl.Yl) 


similar 


similar 


-*1-  os(X2.Zl) 

_JL 

determioes 
►  ruiis(^,  Yi; 


runs(X2,  Yl) 

GENERAUZATION  BASED  ON  ANALOGY 

os{Xl.Zl)  \  os(X2,Zl) 

runs(Xl,  Yl) 

Figure  7:  Generalization  of  the 
reasoning  illustrated  in  Figure  6. 

Now,  if  one  knows,  for  instance,  that 
“os(hp05,  Unix)”,  “runs(hp05,  frame-maker)”, 
and  “os(sun01,  unix)”,  then  one  may 
immediately  infer  “runs(sun01,  frame- 
maker)”,  by  simply  instantiating  the  general 
inference  from  the  bottom  of  Figure  7. 

This  might  not  appear  to  be  a  significant 
saving  but,  in  the  case  of  a  plausible 
justification  tree,  one  generalizes  several  such 
individual  inferences  and,  even  more 
importantly,  their  interconnection  in  a 
plausible  reasoning  process.  For  instance,  the 
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DEDUCTIVE  GENEHAUZATION 
nms(X,  U)  pubKahing-sw) 

GENERALIZATION  BASED  ON  ANALOGY 
osG^T)  ^  os^,  T) 

iuiis(Z,U) 


wiiubteOC.  publiahing) 
DEDUCTIVE  GENERAUZATION 
comnunicale(X.  Y) 

DEDUCTIVE  GENERAUZATION 


isa(Y.  higb-quality-prinier) 


DEDUCTIVE  GENERAUZATION 


w) 

on(X.V) 


\ 


isa(Y,  pfinler) 


on(Y,W) 


/ 


ieaotuiioa(Y,  high) 


speoKY^high) 

EMPIRICAL  INDUCTIVE  GENERAUZATION 

I 

procusoifY,  riac) 


Figure  8:  A  generalized  plausible  justification  tree. 


generalization  of  the  tree  in  Figure  5  is 
presented  in  Figure  8.  The  generalization 
method  is  presented  in  (Tecuci,  1993). 

The  important  thing  about  the  general  tree  in 
Figure  8  is  that  it  covers  many  of  the 
plausible  justification  trees  for  facts  of  the 
form  “suitable(x,  publishing)”.  If,  for 
instance,  sunOl  is  another  computer  for  which 
the  leaves  of  the  plausible  justification  tree  in 
Figure  8  are  true,  then  the  system  will  infer 
that  stmOl  is  also  suitable  for  publishing,  by 
simply  instantiating  the  tree  in  Figure  8  (see 
Figure  9). 

Let  us  mention  again  that  while  building  the 
plausible  justification  tree  in  Figure  5  was  a 
difficult  problem  which  required  the 
employment  of  different  types  of  reasoning, 
and  of  determining  the  simplest  and  the  most 
plausible  justification  tree,  the  building  of  the 
tree  in  Figure  9  was  a  very  simple  process  of 
instantiating  the  tree  in  Figure  8.  However, 


once  this  tree  is  built,  one  may  draw 
conclusions  that  are  similar  to  the  ones  drawn 
from  the  tree  in  Figure  5: 

•  if  the  top  of  the  tree  in  Figure  9  is  known  to 
be  true,  then  one  may  assume  that  all  the 
intermediate  implications  are  also  valid. 
This  reinforces  each  rule  used  in  building 
the  tree  with  a  new  positive  example. 

•  if  the  top  of  the  tree  is  not  true,  then  one  has 
to  identify  the  wrong  implication  and  to 
correct  accordingly  the  KB. 

6.  Experimentation,  veriflcation  and 
repair 

Building  the  plausible  justification  tree  from 
Figure  5  and  its  generalization  from  Figure  8 
was  the  first  stage  of  the  KB  refinement 
process  described  in  Figure  3.  The  next  stage 
is  one  of  experimentation,  verification,  and 
repair. 


suiUble(5Uii0l.  publishing) 


nmsCsiiiiOl,  publUhing-sw) 


(  deductim 


conimuiucate(su^|^TOcrolasei03) 


isa(migolaset03^^Mgh-quality-printet) 


deduction 


isa(fraine-inalca,  publishing-sw) 


nins(suii01 ,  figM-maker) 


/  oii(niicn>Iase 


deduction 


iu(inigolasei03,  piinter) 


os(hpOS,  Unix)  \  os(*uii01,  unix) 
nins(hp05,  fiame-maker) 


oii(aiii01.  fddi) 

connect(fddi,  ethemet) 


resolution(n)icroluet03.  high) 
oii(niicn>Ias^3.  ethemet)  /  ... 

speed(migolasei03,  high) 


7 


inductive  prediction 

T 

{vocussc^micTolaserOS.  rise) 


Figure  9:  An  instance  of  the  plausible  justification  tree  in  Figure  8, 
justifying  that  sun()l  is  suitable  for  publishing. 


The  system  will  generate  plausible  justifi¬ 
cation  trees  like  the  one  in  Figure  9.  lliese 
trees  show  how  statements  of  the  form 
“suitable(x,  publishing)”  plausibly  derive 
from  the  KB.  Each  such  statement  is  shown  to 
the  user  who  is  asked  if  it  is  true  or  false. 
Then,  the  system  (with  the  expert's  help)  will 
update  the  KB  such  that  it  will  deductively 
entail  the  true  statements  and  only  them. 

The  experimentation  phase  is  controlled  by  a 
heuristic  search  in  a  plausible  version  space 
(PVS)  which  limits  significandy  the  number 
of  experiments  needed  to  improve  the  KB.  In 
the  case  of  our  example,  the  plausible  version 
space  is  defmed  by  the  trees  in  Figure  5  and 
Figure  8,  and  is  represented  in  Figure  10. 


plausible  upper  bound 
suitable(X,  publishing) 
os(Z,  T),  runs(Z,  U),  os(X.  T), 
issKU,  publishing-sw),  on(X,  V), 
connect(y,  W),  on(Y,  W), 
isa(Y,  {Winter),  processoti^,  rise), 
resolution(Y,  high). 

plausible  lower  bo>md 
suitable(macn02,  'jublishing) 
os(inaclc03,  nuic-os),  tuns(maclc03,  page-maker), 
os(macn02,  mac-os), 
isa({>age-maker,  publishing-sw), 
on(macn02,  appletalk),  connect(^pletalk,ethemet), 
on(i)ro[wintei01,  ethemet),  isa(i)roprintei01, printer), 
processoi(pFoprintei01,  rise), 
resolutionCjwoprinteiOl,  high). 

Figure  10:  The  plausible  version  space  (PVS) 

The  plausible  upper  bound  is  a  rule  the  left 
hand  side  of  which  is  the  top  of  the  general 
tree  in  Figure  8,  and  the  right  hand  side  of 
which  is  ^e  conjunction  of  the  leaves  of  the 
same  tree.  The  plausible  lower  bound  is  a 
similar  rule  corresponding  to  the  tree  in 
Figure  5.  This  plausible  version  space 
synthesizes  some  of  the  inferential 
capabilities  of  the  system  with  respect  to  the 
facts  of  the  form  “suitable(x,  publishing)”. 
We  call  these  bounds  plausible  because  &ey 
are  only  approximations  of  the  real  bounds 
(Tecuci,  1992).  The  upper  bound  rule  is 
supposed  to  be  more  general  than  the  exact 
rule  for  inferring  “suitable(x,  publishing)”, 
and  the  lower  bound  rule  is  supposed  to  be 
less  general  than  this  rule.  Let  us  notice  that 
this  version  space  corresponds  to  the  version 
space  in  Figure  2.  The  plausible  upper  bound 


corresponds  to  the  plausible  closure,  and  the 
plausible  lower  bound  corresponds  to  the 
deductive  closure.  Of  course,  this  space  is 
restricted  to  facts  of  the  form  “suitable(x, 
publishing)”. 

The  version  space  in  Figure  10  could  be 
represented  in  the  equivalent  form  in 
Figure  11.  Note  that  the  facts  of  the  form 
“isa(Q,  something)”  are  always  true. 


suitable(X,  publishing)  :- 

plausible  upper  bound 

isa(T,  someUung),  isa(U,  publishing-sw), 

isa(V,  something),  isa(W,  smnetbing), 

isa(X,  something),  isa(Y,  printer), 

isa(Z,  something),  os(Z,  T),  ninsCZ,  U),  os(X,  T), 

on(X,  V),  connect(V,  W),  on(Y,  W), 

processor(Y,  rise),  resolution(Y,  high). 

plausible  lower  bound 

isa{T,  mac-os),  isa(U,  publishing-sw), 

isa(V,  appletalk),  isa(W,  ethemet), 

isa(X,  macn02),  isa(Y,  {irinter), 

isa(Z,  maclcOSX  os(Z,  T),  iuns(Z,  U),  os(X,  T), 

on(X,  V),  c(»mect{V,  W),  on(Y,  W), 

processor(Y,  rise),  resolution(Y,  high). 

with  the  positive  example 
T=mac-os,  U=page-maker,  Vs=j^pletalk, 
W=etheraet,  X=macI102,  Y=|)rO|JrinteK)l, 
Z=maclc03. 

Figure  11:  Equivalent  form  of  the 
plausible  version  space  in  Figure  10. 

The  version  space  in  Figure  1 1  serves  both 
for  generating  facts  of  the  form  “suitable(x, 
publishing)”,  and  for  determining  the  end  of 
the  experimentation  phase. 

To  generate  such  a  fact,  the  system  looks  into 
the  KB  for  an  instance  of  the  upper  bound 
which  is  not  an  instance  of  the  lower  bound. 
Such  an  instance  is  the  following  one: 

suitable(sun01,  publishing)  :- 
os(h{)0S,  Unix), 
runs(h{)05,  frame-maker), 
os(sun01,  Unix), 

isa(frame-maker,  publishing-sw), 
on(sun01,  fddi), 
connectffddi,  ethemet), 
on(microlasei03,  ethemet), 
isa(microlaser03,  printer), 
processor(microla%i03,  rise), 
resolution(microlaser03,  high). 

which  could  be  written  as: 
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suiud>le(X.  publishing) 

isa(T,  Unix),  isa(U,  publishing-sw),  isa(V,  fddi), 

isa(W,  etbeinet),  is^X,  sunOl), 

isa^,  mioolaseiOS).  isaG^,  IqioS),  os(Z,  T), 

Tuns(Z,  U),  os(X,  T),  on(X,  V),  connect(V,  W), 
on(Y,  W),  prooessc^Y,  rise),  resolution(Y,  high). 
with  the  positive  example 
T=unix,  Us&ame-D^er,  V=fddi.  Wsethemet, 
X=sun01,  YsmicrolaseiOS,  Z=hp05. 

The  corresponding  instance  of  the  general 
tree  in  Figure  8  shows  how  “suitable(sun01, 
publishing)”  is  plausibly  entailed  by  the  KB 
(see  Figure  9).  The  user  is  asked  if 
“suitable(sun01,  publishing)”  is  true  or  false, 
and  the  KB  is  updated  accordingly. 
Assuming  the  user  accepted  “suitable(sun01, 
publishing)”  as  a  true  fact,  the  KB  and  the 
plausible  version  space  are  updated  as 
follows: 

•  the  KB  is  improved  so  as  to  deductively 
entail  “suitable(sun01,  publishing)”; 

•  the  plausible  lower  bound  of  the  PVS  is 
conjunctively  generalized  to  “cover”  the 
leaves  of  the  tree  in  Figuic  9. 

It  has  already  been  shown  how  the  KB  is 
improved  (see  section  5).  The  plausible 
lower  bound  of  the  PVS  is  generalized  as 
shown  in  Figure  12. 

Let  us  also  consider  the  case  of  a  generated 
fact  which  is  rejected  by  the  user: 

“suitable(macplus07,  publishing)”. 

The  corresponding  plausible  justification  tree 
is  shown  in  Figure  13.  This  tree  was  obtained 
by  instantiating  the  general  tree  in  Figure  8 
with  facts  from  the  O.  It  shows  how  a  false 
fact  is  plausibly  entailed  by  the  KB. 


siiitable(X,  publishing) 

plausible  upper  bound 

isa(T,  something),  isa(U,  publishing-sw), 

isa(V,  something),  isa(W,  something), 

isa(X,  something),  isa(Y,  {ninter), 

isa(Z.  something^  os(^  T),  nins(Z,  U),  os(X,  T), 

onQt,  V),  connectCV,  W),  on(Y,  W), 

processorfY,  rise),  resoiution(Y,  high). 

plausible  lower  bound 
isa(T,  op-system),  isa(U,  publishing-sw), 
isa(V,  networic),  isa(W,  eUiemet), 
i  isa(X,  workstation),  isa(Y,  printer), 
t  isa(Z,  workstation),  os(^  T),  runs(Z,  U),  os(X,  T), 

I  wi^,  V),  connect(V,  W),  on(Y, 
j  processorCY,  rise),  resolution(Y,  high). 

\with  the  positive  example 

I  T=mae-os,U=page-maker,V=appletalk,W=ethemet, 
t  X=maelI02,  Y=proprinter01,  ^macle03. 

T=unix,  U=frame-maker,  V=fddi,  W=ethemet, 
X=sun01,  Y=mierolaser03,  Z=hp05. 

I _ Figure  12:  Updated  PVS. _ 

In  such  a  case  one  has  to  detect  and  correct 
the  wrong  inference(s),  as  well  as  to  update 
the  KB,  the  general  justification  tree  in  Figure 
8,  and  the  plausible  version  space  such  that: 

•  the  tree  in  Figure  13  is  no  longer  a  plausible 
justification  tree; 

•  the  KB  does  not  deductively  entail 
“suitable(macplus07,  publishing)”; 

•  the  updated  general  justification  tree  no 
longer  covers  the  tree  in  Figure  13; 

•  the  plausible  upper  bound  of  the  PVS  is 
specialized  so  Aat  it  no  longer  covers  the 
leaves  of  the  tree  in  Figure  13. 
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Figure  13:  Another  instance  of  the  plausible  justification  tree  in  Figure  8, 
which  shows  how  a  false  fact  is  plausibly  entailed  by  the  KB. 


Detecting  the  wrong  implication  from  the 
plausible  justification  tree  in  Figure  13  is  an 
intrinsically  difficult  problem  for  an  au¬ 
tonomous  learning  system.  One  possible  solu¬ 
tion,  which  is  presented  in  (Tecuci,  1993),  is 
to  blame  the  implication  which  is  the  least 
plausible,  and  the  correction  of  which 
requires  the  smallest  change  in  the  KB.  For  a 
human  expert,  however,  it  should  not  be  too 
difficult  to  identify  the  wrong  implication  and 
even  to  find  the  explanation  of  the  failure 
(Tecuci,  1992).  In  the  case  of  the  tree  in 
Figure  13,  the  wrong  implication  could  be 
identified  by  the  user  as  being  the  deduction 
from  the  top  of  the  tree; 

suitable(iiiaq)lus07,  publishing) 
nins(inaq}lus07,  publishing-sw), 
coinmunicate(inaq)lus07,  laseijetOl), 
isaQaseijetOl,  high-quality-printer). 

Although  macplusO?  runs  publishing  software 
and  communicates  with  a  high  quality  printer, 
it  is  not  suitable  for  publishing  because  it 
does  not  have  a  large  display. 

Consequently,  the  rule  which  generated  the 
above  implication  is  specialized  as  follows 
(requiring  X  to  have  a  large  display): 

suitable(X,  publishing) 
runsQC,  publishing-sw),  display(X,  large), 
coQununicate(X,  Y),  isa(Y,  high-quality-printer). 
with  the  positive  examples 
Xsmacn02,  Y=proprintei01. 

XssunOl,  Y=iniCTOlaser03. 
with  the  negative  example 
Xsinaq)lus07,  Y=laKijet01. 

One  should  notice  that  the  predicate 
“display(X,  Y)”  could  be  defined  by  the  user, 
or  could  be  suggested  by  the  system  as  one 
which  distinguishes  the  known  positive  exam¬ 


ples  of  the  rule  from  the  discovered  negative 
example. 

As  a  result  of  updating  the  above  rule,  the 
general  plausible  justification  tree  in  Figure  8 
is  updated  as  shown  in  Figure  14,  and  the  ver¬ 
sion  space  is  updated  as  shown  in  Figure  IS. 

It  might  not  always  be  easy  to  identify  the 
problem  with  a  wrong  inference,  and  to 
specialize  the  corresponding  rule  so  as  no 
longer  to  cover  the  negative  example.  In  such 
a  case,  the  wrong  inference  is  kept  as  a 
negative  exception  of  the  rule  which 
generated  it,  as  shown  in  Figure  16. 


suitable(X,  publishing) 

plausible  upper  bound 

isa(T,  something),  isa(U,  publishing-sw), 

isa(V,  something),  isa(W,  something), 

isa(X,  something),  isa(Y,  printer),  isa(Z,  something), 

os(Z,  T),  runs(Z,  U),  os(X,  T),  display(X,  large), 

on(X,  V),  connect(V,  W),  on(Y,  W), 

processor(Y,  rise),  resolution(Y,  high). 

plausible  lower  bound 

isa(T,  op-system),  isa(U,  publishing-sw), 

isa(V,  network),  isa(W,  ethemet), 

isa(X,  workstation),  isa(Y,  printer), 

isa(Z,  workstation),  os(Z,  T),  runs(Z,  U),  os(X,  T), 

display(X,  large),  on(X,  V),  connect(V,  W), 

on(Y,  W),  processor(Y,  rise),  resolution(Y,  high). 

with  the  positive  example 

T=mae-os,U=page--maker,V=appletalk,W=ethemet, 
X=maelI02,  Y=q>roprintei01,  Z=macle03. 

T=unix,  U=frame-4naker,  V=fddi,  W=ethemet, 
X=sun01,  Y=mierolaser03,  Z=hp05. 
with  the  negative  example 
T=mae-os,  U=page-makCT,  V=^pletalk,  W=fddi, 
X=maeplus07,  Y =laseijetO  1 ,  ^maele03 . 

_ Figure  15:  Updated  PVS. _ 


lUstung; 


DEDUCTIVE  GENERALIZATION 
iuiis(X,  pu^lung-sw)  display(X,  large)  coniniimcale(X.  Y) 

DEDUCnVEGENERAUZATION  /K 

/  ^  DEDUCTIVEGENERAUZATION 

nills(X,  U)  isa(U,  publishing-sw)  yr  |  v 

^  m  - 

od(X 


GENERALIZATION B/^ED  ON  ANALOGY 
otlG^T)  ^  T) 

iuiis(Z,U) 


isa(Y.  high-quality-printer) 

DEDUCTIVE  GENERAUZA  TlON 
isa(Y.  printer) 


od(Y,W) 


/ 


resolution(Y,  high) 


speed(Y,  high) 

EMPIRICAL  INDUCTIVE  GENERALIZATION 

I 

proccssoi<Y,  rise) 


Figure  14:  Updated  general  justification  tree. 
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suit^le(X.  publisbing) 
ninsQC,  publishing-sw),  ccmmunicateCX.  Y). 
isa(Y.  Ugb-quality-fxinter). 
with  the  positive  examples 
Xsiiiacn02,  YsprqninteiOl. 

XssunOl,  Y=inicrolasei03. 
with  the  negative  exception 
X=maq)lus07,  YslaseijeiOl. 

Figure  16:  A  rule  with  a  negative  exception. 

During  experimentation,  the  lower  bound  of 
the  plausible  version  space  is  generalized  so 
as  to  cover  the  generated  facts  accepted  by 
the  user  (the  positive  examples),  and  the  up¬ 
per  and  lower  bounds  are  specialized  so  as  to 
no  longer  cover  the  generated  facts  rejected 
by  the  user  (the  negative  examples).  This  pro¬ 
cess  will  end  in  one  of  the  following 
situations: 

•  the  bovmds  of  the  plausible  version  space 
become  identical. 

•  the  bounds  are  not  identical,  but  the  KB  no 
longer  contains  any  instance  of  the  upper 
bound  of  the  version  space  that  is  not  an 
instance  of  the  lower  bound.  Therefore,  no 
new  fact  of  the  form  “suitable(x, 
publishing)”  can  be  generated. 

Notice  that  the  plausible  version  space  is  only 
used  for  controlling  the  experimentation 
phase.  It  is  not  kept  in  the  KB  as  a  new  rule 
for  inferring  “suitable(x,  publishing)”  because 
it  would  be  a  redundant  rule. 

7.  Goal-driven  knowledge  elicitation 

Because  the  KB  is  incomplete  and  partially 
incorrect,  some  of  the  learned  knowledge 
pieces  may  be  inconsistent  (i.e.  may  cover 
negative  examples),  as  it  is  illustrated  in 
Figure  16.  In  order  to  remove  such 
inconsistencies,  additional  knowledge  pieces 
(which  may  represent  new  terms  in  the 
representation  language  of  the  system)  are 
elicited  from  the  expert,  through  several  con¬ 
sistency-driven  knowledge  elicitation  meth¬ 
ods,  as  described  in  (Tecuci  and  Hieb,  1993). 
These  methods  are  applied  in  the  third  phase 
of  KB  refinement,  as  shown  in  Figure  3. 

Let  us  consider  the  case  of  the  inconsistent 
rule  in  Figure  16. 

One  consistency-driven  knowledge  elicitation 
method  is  to  look  for  a  new  predicate  which 
could  characterize  ail  the  positive  instances  of 


X  (although  it  might  not  be  associated  with 
each  of  these  instances),  without  character¬ 
izing  any  negative  exception  of  X.  A 
potentially  discriminating  predicate  like 
“display”  is  one  which  characterizes  a 
positive  instance  of  X  (either  macII02  or 
sunOl),  and  does  not  characterize  the  negative 
exception  of  X  (macplus07).  If  such  a 
predicate  is  found,  then  the  user  is  asked  if  it 
characterizes  all  the  other  positive  instances 
of  X.  The  same  technique  could,  of  course,  be 
applied  to  the  instances  of  Y. 

It  may  happen,  however,  that  the  system 
cannot  find  a  property  to  transfer  from  one 
positive  example  of  X  to  the  others.  In  such  a 
case,  it  will  tiy  to  elicit  a  new  property  by 
using  a  technique  similar  to  the  triad  method 
employed  in  the  elicitation  of  the  repertory 
grids  (Boose  and  Bradshaw,  1988;  Shaw  and 
Gaines,  1987). 

Another  method  for  removing  the  negative 
exception  is  to  look  for  a  relationship  between 
X  and  Y  which  could  characterize  all  the 
positive  instances  of  X  and  Y,  without 
characterizing  the  negative  exception. 

Yet  another  method  is  to  define  a  new 
concept  that  covers  all  the  positive  instances 
of  X  (or  all  the  positive  instances  of  Y), 
without  covering  the  negative  exception  of  X 
(Y).  A  method  similar  to  this  one  is  reported 
by  (Wrobel,  1989). 

8.  Summary  and  conclusions 

Figure  2  suggests  two  basic  approaches  to 
the  development  of  the  competence  of  a 
deductive  knowledge-based  system.  One 
approach  is  to  extend  the  deductive  closure 
of  the  KB  by  acquiring  new  knowledge.  The 
other  approach  is  to  replace  the  deductive 
inference  engine  with  a  plausible  inference 
engine,  and  thus  to  enable  the  system  to 
solve  additional  problems  from  the  plausible 
closure.  The  first  approach  has  the  advantage 
that  the  system  employs  "sound"  reasoning, 
but  it  has  the  disadvantage  of  requiring  a 
difficult  knowledge  acquisition  process.  The 
second  approach  has  the  advantage  of 
avoiding  knowledge  acquisition,  but  the 
disadvantage  that  the  system  needs  to  rely  on 
plausible  reasoning. 

The  knowledge  refinement  method  presented 
in  this  paper  is  an  attempt  to  combine  these 


two  approaches  in  such  a  way  as  to  take 
advantage  of  their  complementarity.  This 
method  is  summarized  in  Figure  17.  It 
resulted  from  the  merging  and  extension  of 
two  related  research  dictions: 

•  the  knowledge  acquisition  methodology  of 
Disciple  (Tecuci  and  Kodratoff,  1990)  and 
NeoDisciple  (Tecuci,  1992); 

•  the  MTL-JT  framework  for  multistrategy 
learning  based  on  plausible  justification 
trees  (Tecuci,  1993). 

On  the  one  hand,  it  extends  NeoDisciple  with 
respect  to  the  knowledge  representation  used 
and  the  types  of  inferences  and 
generalizations  employed  and,  on  the  other 
hand,  it  adapts  and  integrates  the  MTL-JT 
framework  into  an  interactive  knowledge 
acquisition  scenario. 

The  method  is  based  on  the  following 
general  idea.  The  system  performs  a  complex 
reasoning  process  to  solve  some  problem  P. 
Then  it  determines  a  justified  generalization 
of  the  reasoning  process  so  as  to  speed  up  the 
process  of  solving  similar  problems  P,.  When 
the  system  encounters  such  a  similar 
problem,  it  will  be  able  to  find  a  solution  just 
by  instantiating  the  above  generalization. 

In  the  context  of  the  presented  method,  the 
problem  to  solve  is  to  f.xtend  the  KB  so  as  to 
entail  a  new  fact  /.  The  complex  reasoning 
process  involved  consists  of  building  a  plau¬ 
sible  justification  tree.  This  reasoning 
process  is  generalized  by  employing  various 
types  of  generalization  procedures.  Then, 
during  the  experimentation  phase,  the  system 
instantiates  this  generalization  and,  using  it, 
improves  the  KB  so  as  to  entail  similar  facts 
which  are  true  (or  to  no  longer  entail  similar 
facts  which  are  false). 

One  important  aspect  of  the  presented 
method  is  the  notion  of  plausible  justification 
trees  (Tecuci  and  Michalski,  1991;  Tecuci, 
1993).  Other  systems  have  employed  implicit 
justification  trees  (DeRaedt  and  Bruynooghe, 
1993),  or  even  explicit  justification  trees 
(Tecuci,  1988;  Mahadevan,  1989;  Widmer, 
1989),  that  integrated  only  a  small  number  of 
inferences.  In  our  method,  the  plausible 
justification  tree  is  defined  as  a  general 
framework  for  integrating  a  whole  range  of 
inference  types.  Therefore,  theoretically, 
there  is  no  limit  with  respect  to  the  type  or 
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Figure  17  :  Moditication  of  DC  and  PC 
during  KB  refinement. 

number  of  inferences  employed  in  a  plausible 
justification  tree. 

Another  important  feature  of  the  KB 
refinement  method  is  the  employment  of 
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different  types  of  generalizations.  While  the 
current  machine  learning  research 
distinguishes  only  between  deductive 
genera^zations  and  inductive  generalizations 
(Michalsld,  1993),  this  method  and  the  MTL- 
JT  framework  (Tecuci,  1993)  suggest  that 
one  may  consider  other  types  of 
generalizations,  each  associated  with  a 
certain  type  of  inference  (as,  for  instance, 
generalization  based  on  analogy). 

There  are  also  several  ways  in  which  the 
method  could  be  improved.  For  instance,  the 
set  of  inferences  involved  in  the  present 
version  of  the  method  is  quite  limited 
(deduction,  determination-based  analogy, 
inductive  prediction,  and  abduction).  New 
types  of  inferences  should  be  included,  as 
well  as  more  complex  versions  of  the  current 
ones. 

Also,  new  types  of  justified  generalizations 
(each  corresponding  to  a  certain  inference 
type)  should  be  defined. 

Finally,  the  goal-driven  knowledge  elicitation 
methods  briefly  mentioned  in  section  7 
should  be  extended  so  as  not  only  to  add  new 
concepts  and  relationships  into  the  KB  but 
also  to  delete  those  that  bwome  unnecessary. 
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Abstract 

A  novel  learning  task,  that  of  learn¬ 
ing  to  survive  in  an  \mknown  and 
hostile  environment,  is  defined  and 
explored.  The  environment  is  incor¬ 
porated  in  an  adventure-game  of  the 
"hack”  family  and  shares  several  im¬ 
portant  features  with  the  real  world: 
the  actions  are  non-deterministic,  the 
agents  posses  only  partial  knowledge 
about  their  performance  and  about 
the  world,  and  they  have  only  a  lim¬ 
ited  influence  on  their  environment. 
Centred  in  the  adventure  game  is 
the  autonomous  agent,  which  con¬ 
sists  of  a  planning  and  a  learning 
component.  This  agent  (and  its  op¬ 
ponents)  can  execute  actions,  which 
will  change  the  environment;  some 
actions  will  have  desirable  effects 
with  respect  to  the  agent’s  goals,  oth¬ 
ers  will  not.  Initially,  the  effects  of 
the  actions  and  the  behavior  of  the 
opponents  is  unknown  to  the  agent. 
In  order  to  survive  in  the  hostile  en¬ 
vironment,  the  agent  has  to  learn 
the  effects  of  the  actions,  the  be¬ 
haviour  of  its  opponents,  to  evalu¬ 
ate  the  situation  it  is  in,  and  to  exe¬ 
cute  the  appropriate  actions.  To  this 
aim,  we  have  implemented  an  agent 
incorporating  multiple  strategies  for 
learning:  learning  by  experimenta¬ 


tion  (knowledge  discovery),  empiri¬ 
cal  induction  on  examples  (using  in¬ 
ductive  logic  programming),  learn¬ 
ing  control  knowledge  (by  modifying 
evaluation  functions),  and  learning 
from  experience  (using  a  knowledge 
base  manager). 

1  Introduction 

Central  to  intelligence  (whether  natural  or 
artificial)  is  the  ability  to  adapt  to  and  to 
perform  well  in  an  unknown  environment. 
Whereas  the  real  world  is  still  a  very  complex 
environment  for  artificial  intelligence,  some 
fundamental  properties  of  the  real  world  can 
easily  be  modelled  in  artifleial  worlds  such 
as  adventure  games.  Although  such  artificial 
worlds  are  usually  considered  as  toy-domains, 
they  often  share  important  properties  with  the 
re2d  world,  which  mezois  that  understanding 
intelligence  in  artifleial  worlds  can  enhance  our 
general  understanding  of  intelligence.  At  the 
scime  time,  artifleial  worlds  have  the  advantage 
that  their  complexity  can  easily  be  controlled, 
making  feasible  a  stepwise  introduction  of  real 
world  characteristics. 

In  this  paper,  we  explore  an  artifleial  world  of 
an  adventure  game  of  the  "hack”  or  "rogue” 
family  [Raymond  and  Thi-eepoint,  ;  Stephen¬ 
son,  ].  In  this  type  of  game,  the  player  controls 
an  agent  living  in  a  hostile  environment  ran¬ 
domly  generated  on  the  board.  The  environ- 


meat  is  hostile  in  the  sense  that  other  agents 
live  in  the  same  environment  and  may  attack 
the  player’s  agent.  To  siirvive,  the  player  has 
to  collect  items  (such  as  food,  weapons,  etc.), 
to  eat  and  drink  at  regular  times  and  to  kill  his 
opponents.  This  kind  of  game  has  the  follow¬ 
ing  real  world  characteristics:  agents  posses 
only  partial  knowledge  of  their  current  situ¬ 
ation  and  their  performance,  to  survive  they 
have  to  learn  the  effects  of  their  actions  and 
the  behavior  of  their  opponents,  and  they  have 
to  use  that  knowledge  in  order  to  select  appro¬ 
priate  actions  to  execute.  The  main  contribu¬ 
tion  of  this  work  is  the  design  and  implemen¬ 
tation  of  an  intelligent  agent,  which  lecirns  to 
survive  amd  to  improve  its  performance  in  a 
non-trivial  artificial  world.  Furthermore,  be¬ 
cause  the  agent’s  most  distinct  characteristic 
is  its  ability  to  learn  using  multiple  strate¬ 
gies  (and  it  seems  very  hard  to  survive  us¬ 
ing  a  single  state-of-the-art  learning  strategy, 
if  not  impossible),  our  work  provides  evidence 
that  multistrategy  learning  is  central  to  intel¬ 
ligence.  Our  agent  learns  different  types  of 
knowledge:  rules  to  predict  the  effects  of  ac¬ 
tions  and  the  behavior  of  opponents  (using 
empirical  learning  from  examples),  an  evalu¬ 
ation  function  to  assess  the  degree  of  perfor¬ 
mance  (learning  control  knowledge),  and  our 
agent  maintains  only  the  interesting  rules  in 
its  knowledge  ba;.e  (learning  from  experience 
using  a  knowledge  base  manager).  It  uses  a 
kind  of  minimax  algorithm  as  planner  to  se¬ 
lect  the  next  action  to  execute. 

This  paper  is  organized  as  follows:  in  Section 
2,  we  discuss  the  artificied  world  incorporated 
in  the  adventure  game,  in  Section  3,  we  present 
the  overall  architecture  of  the  learning  system, 
in  Section  4,  we  discuss  rule  generation  and 
knowledge  base  management,  in  Section  5,  we 
show  how  the  evaluation  function  is  learned,  in 
Section  6,  we  present  the  planner,  in  Section 
7,  we  report  on  the  current  state  of  the  system 
and  some  experiments,  finally  in  Section  8,  we 
conclude  and  touch  briefly  on  related  work. 


2  The  environment 

The  game  is  played  using  the  graphical  inter¬ 
face  shown  in  Figure  1.  The  available  knowl¬ 
edge  about  the  current  situation  of  the  learn¬ 
ing  agent  (or  the  hiunan  player,  if  the  game 
is  pla>  -i.d  by  humans)  is  shown  on  a  board  of 
9  positions  (the  board  shown  in  Figure  1  con¬ 
tains  49  positions,  but  only  the  9  central  po¬ 
sitions  are  observable  by  the  player).  The  po¬ 
sition  in  the  middle  of  the  board  is  always  oc¬ 
cupied  by  the  learning  agent.  The  position  of 
the  learning  agent  is  identified  by  the  number 
0  and  the  agent  is  represented  by  fo].  Neigh¬ 
bouring  positions  have  numbers  ranging  from 
1  to  8.  The  neighouring  positions  may  contain 
opponents  (Humans  [Hj  Dragons  ^and  Bats 
1^)  and  objects  (Gold  [^,  Food  “%\,  Potions 
[T],  Armors  F  ,  Heindweapons  jY],  Wands 

and  spells  [T  of  different  t5rpes);  see  the  Ap¬ 
pendix  for  more  details.  For  example,  in  Fig¬ 
ure  1,  the  agent  is  next  to  a  Bat  (B)  on  position 
5,  next  to  food  (on  position  2,  next  to  a  wall 
(*),  and  next  to  a  position  containing  several 
objects  (#).  Objects  can  be  cursed;  cursed 
objects  behave  differently  than  imcursed  ones. 
Curses  can  be  added  or  removed  by  quaffing 
potions,  Ccisting  spells  or  zapping  wands.  Ob¬ 
jects  are  possessed  by  agents  (the  learning 
agent  has  an  inventory  of  the  objects  it  owns) 
or  positioned  on  the  board.  To  survive  in  the 
gctme,  the  agent  has  to  gather  objects  and  use 
them  as  a  protection  against  its  opponents. 
The  aim  of  the  game  is  twofold:  to  become 
rich  (i.e.  acquire  as  much  gold  as  possible)  and 
to  maximize  the  energy  level.  The  energy  level 
is  the  only  means  of  direct  performance  eval¬ 
uation.  It  decreases  when  the  agent  is  hit  by 
opponents,  when  it  hits  one  of  its  opponents  or 
when  it  eats  cursed  food;  it  increases  when  the 
agent  consumes  uncursed  food.  If  the  energy 
level  falls  below  zero,  the  agent  dies. 

The  game  operates  in  a  cyclic  process.  In  the 
first  step  of  each  cycle,  the  leznning  agent  can 
execute  an  action  and  in  the  second  step  of 


94 


each  cycle,  all  its  opponents  may  execute  an 
action.  A  list  of  possible  actions  is  spediied 
in  Appendix  1.  The  cyclic  process  continues 
until  the  agent  dies  or  the.user  stops  the  game. 
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Figure  1  :  graphical  interface  of  the 
game. 

Important  for  our  scientific  purposes  is  that 
the  actions  are  non  deterministic,  that  the 
only  direct  evaluation  criteria  our  agent  has 
access  to  is  the  energy  level  and  the  number  of 
gold  pieces,  and  that  when  the  agent  moves  to¬ 
wards  previously  unvisited  positions,  they  are 
filled  out  randomly.  Non  deterministic  in  our 
context  means  that  executing  the  same  action 
twice  in  the  same  situation  may  yield  different 
effects  (determined  by  a  random  generator). 
For  instance,  when  hitting  an  agent,  three  ef¬ 
fects  can  occur:  the  agent  misses,  the  agent 
hits  the  other  agent  (and  the  other  agent’s  en¬ 
ergy  level  decreases),  the  agent  hits  the  other 
agent  (and  the  other  agent’s  energy  level  falls 
below  zero,  resulting  in  the  death  of  the  other 
agent). 

The  initial  knowledge  of  the  game  our  agent 
starts  from  is  similar  to  that  of  a  novice  hu¬ 
man  player.  More  specifically,  the  ONLY  ini¬ 
tial  knowledge  our  agent  possesses,  is  the  fol¬ 


lowing: 

•  complete  knowledge  of  the  effects  of  the 
move  action; 

•  for  all  other  actions,  a  list  of  determina¬ 
tions  [Russell,  1989]  specifying  for  each 
action,  the  possibly  relevant  literals;  de¬ 
terminations  capture  the  intuitive  knowl¬ 
edge  a  novice  human  player  has  about  the 
actions  that  can  be  executed;  for  instance, 
for  the  actions  drop  and  pick  up  only  ob¬ 
jects  possessed  or  on  the  current  posi¬ 
tion  of  the  agent  and  general  characteris¬ 
tics  (such  as  curses,  energy  and  gold)  are 
relevaint  whereas  the  opponents  and  the 
neighbouring  positions  are  irrelevant;  for 
the  actions  wear  and  take  off  only  features 
related  to  armors  and  general  character¬ 
istics  are  relevant;  the  determinations  are 
used  to  remove  irrelevant  literals  when 
constructing  examples,  i.e.  only  literals 
matching  the  determinations  are  included 
in  the  example  descriptions; 

•  a  list  of  general  features  influencing  the 
aim  of  the  game  (control  knowledge);  this 
knowledge  is  used  by  the  evaluation  func¬ 
tion  learner  (see  Section  6)  and  is  basi¬ 
cally  a  list  of  features  such  as  the  number 
of  armors,  the  number  of  weapons,  the 
weapons  being  wield,  the  number  of  op¬ 
ponents  on  neighbouring  positions,  etc. 

We  believe  such  knowledge  corresponds  to 
prior  assumptions  and  expectations  each  ra¬ 
tional  agent  has  (and  should  have)  when  op¬ 
erating  in  this  domun.  Learning  without  such 
knowledge  is  possible  as  well,  but  is  much 
slower  and  slightly  harder. 

3  The  Autonomous  Agent’s 
Architecture 

Figure  2  shows  the  architecture  of  the  system, 
its  main  components  and  the  information  flow 
between  the  different  components. 
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Figure  2:  The  Agent’s  Architecture 


In  the  two  steps  of  the  game  cycle,  the  cxir- 
rent  situation  of  the  game  is  transformed  in 
the  next  one.  In  the  first  step,  this  is  done 
by  executing  the  action  selected  by  the  au¬ 
tonomous  agent,  whereas  the  transformation 
of  the  second  step  is  determined  by  the  game. 
Each  transformation  is  encoded  in  a  symbolic 
description,  which  is  passed  to  the  rule  gener¬ 
ator  as  an  example.  The  rule  generator  possi¬ 
bly  generalizes  (using  inductive  logic  program¬ 
ming)  the  rules  in  the  knowledge  base  with 
the  given  example,  and  passes  the  resulting 
generalizations  (or  the  example  when  no  good 
generalization  is  found)  to  the  knowledge  base 
manager,  which  may  reorganize  the  knowledge 
base  to  accomodate  new  rules.  The  task  of  the 
knowledge  base  manager  is  to  decide  which 
niles  to  keep  and  which  ones  to  forget.  The 
planner  uses  the  knowledge  base  and  the  eval¬ 
uation  function  to  select  the  next  action  to  ex¬ 
ecute.  The  planner  performs  a  variant  of  mini 
max  search.  The  evaluation  function  is  learned 
by  an  independent  control  learning  module. 

The  rules,  their  generator  and  the  knowledge 
base  manager  are  described  in  Section  4;  the 
evaluation  function  learner  in  Section  5;  and 
the  planner  in  Section  7.  Because  the  current 
system  is  very  complex  (it  is  implemented  in 
Prolog  by  BIM,  containing  more  than  10000 
lines  (comments  not  included)  of  Prolog  code), 
we  shall  make  some  slight  simplifications  in  the 
presentation  of  the  system  and  focus  on  con¬ 
cepts  rather  than  on  implementation  details. 


4  Learning  and  managing  rules 

4.1  Representing  rules  and  examples 

We  first  introduce  some  concepts,  which  are 
illustrated  in  Example  1.  A  situation  descrip¬ 
tion  (of  a  state  in  the  game)  is  a  conjunction 
of  grotmd  atoms  (as  perceived  by  the  learning 
agent).  Negated  atoms  axe  not  explicitly  listed 
in  situation  descriptions  as  the  closed  world  as¬ 
sumption  is  being  used  [Reiter,  1978].  The  no¬ 
tation  for  the  closure  of  a  situation  description 

5  is  cu>a(5).  Also,  rather  than  applying  the 
closed  world  assumption  as  it  is,  we  explicitly 
introduced  negated  literals  for  selected  pred¬ 
icates.  The  list  of  these  predicates  and  their 
closure  is  given  in  Appendix  2. 

For  each  step  in  all  cycles  of  the  game,  an 
exjunple  is  constructed.  An  example  is  com¬ 
posed  of  four  parts:  the  description  S  of  the 
situation  before  the  cycle,  an  atom  Act  denot¬ 
ing  the  action  executed,  and  an  add-  and  a 
remove-list  {Al,  Rl).  The  remove  list  contains 
the  set  of  a  literals  that  have  to  be  removed 
from  the  situation  description  after  executing 
the  actici;  the  add  list  contains  the  list  of  lit- 
ersds  to  oe  added  to  the  situation  description. 
This  STRIPS-like  representation  was  choosen 
because  of  its  simplicity  and  power  [Pikes  and 
Nilsson^  1971). 

The  example  generator  constructs  for  each 
pair  of  situation  descriptions  Si  and  S2 
and  connecting  action  A,  an  example 
{S,Act,Al,Rl)  where  S  =  Si  U  {->1  |  I  6 
S2  —  5*1 }  (the  second  set  contains  the  nega¬ 
tion  of  all  literals  present  after  applying  the 
action  but  not  present  before);  A  =  Act] 
Al  =  cu;a(5'2)  —  cwa[Si)]  and  Rl  =  cwa{Si)  — 
cwa{S2y-  Negated  literals  are  only  used  to 

^Furthermore,  the  energy  and  gold  level  of  the  sit¬ 
uation  descriptions  Si  and  5a  are  compared.  For  non¬ 
identical  levels,  the  comparison-predicate  less.than  is 
applied  and  a  literal  for  this  predicate  is  included  in 
the  add-list.  Also,  literals  considered  irrelevant  for  the 
action  are  filtered  from  the  situation  description  (cf. 
Section  2). 
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model  differences  between  the  two  situations. 

The  structure  of  rules  is  a  generalization  of 
that  of  examples.  A  rule  is  composed  of  five 
parts:  a  conjimction  of  literals,  denoting  the 
condition  part  S',  an  atom  Act  denoting  the 
action  executed,  an  add-  and  a  remove-list 
{Al,Rl),  and  a  list  of  counters  C  storing  in¬ 
formation  about  the  correctness,  age  etc.  of 
the  rule.  For  the  moment,  we  ignore  the  coun¬ 
ters.  We  shall  discuss  them  in  detail  in  Section 
4.3. 

A  rule  {SriActr,Alr,Rlr,Cr)  matchts  a  rule 
{Sty  Actt,  Alt,  Hit,  Ct)  (or  an  example)  if  and 
only  if  there  is  a  substitution  6  such  that  Sr6  G 
St,  Actrd  =  Adt  and  Alr9  =  Alt  and  RlrO  = 
Hit. 

Rules  (S,  Act,  Al,  Rl,  C)  can  be  used  for  pre¬ 
dicting  the  resulting  situation  description  Sr 
when  executing  action  A  in  situation  So  pro¬ 
vided  that  there  is  a  substitution  6  such  that 
Sd  C  So  and  Act6  =  A.  The  predicted  situa¬ 
tion  description  S,  =  So- RIOUAW.  Rules  for 
the  learning  agent  and  the  opponent  are  rep¬ 
resented  using  the  same  formalism.  Actions 
executed  by  an  opponent,  have  an  extra  argu¬ 
ment,  the  agent  executing  the  action. 

Two  rules  {Sr,Actr,Alr,Rlr,Cr)  and 
{St,  Actt,  Alt,  Rlt,  Ct)  are  similar  if  and  only  if 
Actr  and  Actt  are  similar,  and  {Air  and  Alt) 
and  {Rlr  and  Rlt)  are  similar.  Two  literals  are 
similar  if  and  only  if  they  have  the  same  pred¬ 
icate  symbol  and  sign.  Two  lists  of  literals  are 
similar  if  and  only  if  each  literal  of  the  first 
list  is  similar  to  a  literal  of  the  second  one, 
and  vice  versa. 


Example  1  :  Situations,  examples,  and 
rules 

A  situation  description  is,  for  instance: 

agent_on(l,d),  itemj3n(2,s),  obstacie_on(3), 
possessing(w),  kind. of _agent(d, dragon), 
kind(s,spell),  kind(w,weapon), 


kindjof.weapon(w,handweapon), 
kindjofJtand.weapon(w,dagger), 
myjenergy(500),  my.gold(0),  alive,  uncursed, 
myJ(ind(human). 

When  executing  the  action  wieid(w),  the  re¬ 
sulting  situation  description  would  be: 

vnelding(w),  agentjon(l,d).  itemjE>n(2,s). 
obstaciejon(3).  possessing(w), 
kindjof..agent(d, dragon), 
kind(s,speli),  kind(w, weapon), 
kind.of.weapon(w,handweapon), 

'  kindjofJiand_weapon(w,dagger), 
myjenergy(500),  my.gold(0),  alive,  uncursed, 
myJcind(human). 

The  example  constructed  from  this  transfor¬ 
mation  would  then  be  (imder  the  assumption 
that  the  irrelevant  literals  for  wield  are  the  ones 
containing  information  about  the  neighbour¬ 
ing  positions  of  the  agent): 

S  =  not.melding(w),  po$sessing(w), 

kind(w,weapon), 

kind_of.weapon(w,handweapon), 

kind.ofJiand-weapon(w,dagger), 

myjenergy(500),  my.goid(0),  alive,  uncursed, 

myJ(ind(human); 

Ad  —  wield(w); 

Al  =  wielding(w); 

Rl  =:  not.wielding(w). 

Applying  this  rule  to  the  original  situation  de¬ 
scription  indeed  results  in  the  transformed  sit¬ 
uation  description.  Real-life  situation  descrip¬ 
tions  of  game  states  are  usually  more  compli¬ 
cated  as  more  objects  and  opponents  are  in¬ 
volved.  0 

4.2  Rule  generation 

Rules  in  the  knowledge  base  are  stored  into 
different  classes  of  similar  rules.  Each  class  of 
rules  is  organized  in  a  binary  tree,  where  there 
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is  a  connection  between  a  parent  rule  and  two 
children  if  the  parent  is  the  generalization  of 
the  children.  The  most  general  rule  of  each 
class  can  thus  be  found  in  the  top  of  the  cor¬ 
responding  tree. 

The  knowledge  base  is  modified  for  each  in¬ 
coming  example.  If  there  is  no  rule  (class)  in 
the  knowledge  base  similar  to  the  example,  a 
new  class  of  rules  containing  only  the  example 
is  added  to  the  knowledge  base.  If  there  is  a 
class  of  rules  similar  to  the  example,  the  corre¬ 
sponding  tree  is  searched  for  a  rule  matching 
the  example  and  all  coimters  of  rules  match¬ 
ing  the  example  are  updated  (see  below).  If  no 
rule  matching  the  example  exists,  the  general¬ 
ization  of  the  top  rule  in  the  tree  and  the  ex¬ 
ample  is  computed  and  added  to  the  tree.  The 
generalization  has  thus  two  children:  the  top 
rule  of  the  original  binary  tree  and  the  exam¬ 
ple.  Furthermore  the  tree  may  be  reorganized 
according  to  the  knowledge  base  management 
principles  outlined  in  the  Section  4.3. 

The  generalization  of 

a  rule  (SrtActr,Alr,Rlr,Cr)  and  an  example 
{Set  Actg,  Alei  Rle)  is  based  on  Plotkin’s  well- 
known  /yy-operator  (see  [Plotkin,  1970]):  The 
generalized  rule  (5„  Actg,  Alg,  Rig,  Cg)  is  com¬ 
puted  as  follows  (the  actual  implementation  of 
this  algorithm  is  described  in  detail  in  [Bleken, 
1092]): 

•  compute  Sg  —*  Actg  =  Plotkin’s  lgg(  Sr  -* 
Actr,Se  — »  Acte);  this  defines  Sg  and 
Actg, 

•  determine  6r  and  6e  such  that  1)  {Sg  —* 
Actg)9r  =  Sr  —*  Actr  and  2)  {Sg  — » 
Actg)6e  =  Se  — »  Acte] 

•  determine  6~^  and  d~^  such  that  3) 
Rlr^r^  C  Sg  and  4)  Rle9~^  C  Sg  and  5) 
Rlr^r^  =  Rle^e^;  Conditions  3)  to  5)  are 
needed  for  constructing  meaningful  gen¬ 
eralizations  because  inverse  substitutions 
are  not  necessarily  unique; 

•  Alg  lgg{AlrBr' t Alede^)  and  Rig  := 

Ri.e:^-, 


•  reduce  Sg  as  much  as  possible  without 
deleting  literals  firom  Rig  using  Plotkin’s 
reduction  algorithm. 

Because  the  generalization  algorithm  is  quite 
complex  and  relies  on  some  concepts  from  in¬ 
ductive  logic  programming  such  as  inverse  sub¬ 
stitutions  [Muggleton  and  Bimtine,  1988]  and 
least  general  generalization  [Plotkin,  1970],  we 
illustrate  it  by  Example  2. 

Example  2  :  Generalizing  rules 

Consider  the  following  (simplified)  example  e 
and  rule  r.  The  example  starts  from  a  state 
where  there  is  a  bat  on  position  5  and  a  hu¬ 
man  on  position  7.  When  hitting  the  bat  (with 
bare  hands),  the  bat  dies  (disappears)  and  the 
bat’s  weapon  (an  axe)  is  found  on  position  5. 
The  meaning  of  the  rule  is  similar. 

Se  ~  agent.on(5,b),  kind(b,Bat),  agentjon(7,h), 
kmd(h,Human),  notJtemjDn(5.a); 

Sr  ~  agent.on(7,d),  kind(d.Dragon), 

notJtemj>n(7,s); 

Acte  =  hit(5); 

Actr  =  bit(7); 

Rle  =  agent.on(5,b),  kind(b,Bat), 
notJtemjon(5,a); 

Rlr  =  agent-on(7.d),  kind(d.  Dragon), 

notJtemjon(7,s); 

Ale  =  kind(a,Axe).  itemj>n(5,a); 

Air  =  kind(s, Sword),  itemjDn(7,s); 

This  residts  in  generalizing  the  following  lit¬ 
erals  together  (following  Plotkin): 

''/y^(agent_on(5,b),  agent jon(7,d))  = 
agent_on(P,A) 

/yp(kind(b,Bat),kind(d,  Dragon))  =  kind(A,Kl) 
/y^(notJtemjon(5,a),notJtemj9n(7,s))  = 
notJtemjon(P,l) 

/yp(agent_on(7,h),agent_on(7,d))  = 
agent-on(7,A2) 

/y^(kind(h, Human), kind(d, Dragon))  = 

kind(A2,K2) 


Mhit(5).hlt(7))  =  hit(P) 
yielding: 

Sg  (non-reduced)  =  agent.on(P,A),  kind(A,Kl). 
notJteinjon(P,i),  agent-on(7.A2),  kind(A2.K2) 
Actg  =  hit(P); 

{  P  =  5.  A  =  b.  K1  =  Bat,  I  =  a.  A2  = 
h,  K2  =  Human  } 

=  {  P  =  7,  A  =  d,  K1  =  Dragon,  I  =  s,  A2 
=  d,  K2  =  Dragon  } 

6~^  is  unique  (no  term  appears  twice  on  the 
right  hand  side  of  equations);  the  inverse  sub¬ 
stitution  for  satisfying  the  requirements  is: 
{  7  — »  P,  d  — »  A,  Dragon  — »  Kl,  s  — >  I  } 

which  results  in  : 

Rig  =  agent-on(P,A),  kind(A.Kl), 
notJtemjE>n(P,l); 

Alg  =  /^^((kind(l,Axe),  itemjon(P,l)), 
(kind(l,Sword),  itemjon(P,l)))  =  kind(l,K3), 
itemjon(P,l), 

The  reduced  5,  =  agent_on(P,A),  kind(A,Kl), 
notJtemjon(P,l),  which  yields  a  meaningful 
generalization.  Notice  that  the  rule  g  matches 
the  example  and  the  rule  it  was  generalized 
from.  0 


4.3  Knowledge  base  management 

In  the  previous  section,  we  discussed  how  the 
agent  generalized  riiles  from  experience.  Here, 
we  shall  present  the  management  principles 
of  the  rule  base.  The  aim  of  knowledge  base 
management  is  to  memorize  only  the  most  in¬ 
teresting  rules  and  to  forget  the  other  ones. 
Forgetting  uninteresting  rules  is  necessary  for 
efficiency  purposes.  Therefore,  the  knowledge 
base  manager  keeps  track  of  a  number  of  coun¬ 
ters  for  each  rule: 

•  tested  (T):  contains  the  number  of  sim¬ 
ilar  actions  executed  since  the  rule  was 


generated; 

•  applicable  (A):  contains  the  number 
of  times  the  situation  part  of  the  rule 
matched  the  given  situation  and  the  ac¬ 
tion,  executed  in  the  given  situation,  was 
similar  to  the  action  of  the  rule; 

•  correctness  (C):  contains  the  number  of 
times  the  rule’s  prediction  was  correct 
when  it  was  applicable. 

One  can  easily  see  that  the  T  counter  encodes 
a  kind  of  age  of  the  rule  (the  number  of  times 
it  could  have  been  used),  and  that  the  prob¬ 
ability  that  the  result  will  be  correctly  pre¬ 
dicted  by  the  rule  (when  the  current  situa¬ 
tion  matches  the  condition  part  of  the  nile 
and  the  corresponding  action  is  executed)  is 
C/A.  Using  these  counters,  it  is  easy  to  de¬ 
fine  criteria  for  accepting  a  rule  as  promis¬ 
ing  (and  deleting  its  children  from  the  binary 
tree)  and  for  deleting  a  rule.  Clearly,  niles 
that  have  been  around  for  a  long  time  (i.e. 
with  a  large  T  cotmter)  and  that  are  seldom 
applicable  (i.e.  with  a  small  A  counter)  are 
not  interesting  as  they  have  a  low  probabil¬ 
ity  of  being  used.  Also,  rules  that  have  a  low 
probability  of  being  correct  (i.e.  with  very 
low  C/A)  can  best  be  forgotten.  Furthermore, 
when  a  rule  has  been  proven  to  be  correct  a 
number  of  times,  its  children  are  forgotten. 
These  principles  have  been  implemented  in 
the  system  (cf.  [Bleken,  1992;  Swennen,  1991; 
Chaouat,  1991])  and  have  proven  to  result  in 
small  knowledge  bases  containing  useful  rules. 

5  Learning  the  weights  of  an 
evaluation  function 

The  only  direct  means  to  evaluate  the  perfor¬ 
mance  of  the  agent  is  its  current  score,  which 
is  the  value  of  the  static  evaluation  function 
Stat{S)  =  u/iX  Energy{S)  +W2X  Gol^S)  (the 
weights  are  known  to  the  agent  and  u;2  is  much 
smaller  than  wi)  in  the  current  situation  5. 
Notice  that  the  static  evaluation  function  is 
not  in  an  operation2d  form,  in  the  sense  that 
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it  does  not  say  anything  about  the  relevance 
of  possessing  items,  of  killing  opponents,  of 
quaffing  potions,  etc.  When  planning  for  the 
next  action  to  execute  (cf.  next  section),  the 
agent  performs  a  variant  of  mini  max  search 
using  the  rules  in  the  knowledge  base  and  an 
evaluation  function.  If  Slot  is  the  evaluation 
function  being  used,  the  agent  will  have  to  con¬ 
sider  a  very  deep  search  tree,  which  is  unde¬ 
sirable:  it  requires  a  lot  of  computation  and  in 
the  particular  context  of  the  game,  the  leaves 
become  much  more  uncertain  (moving  outside 
the  visible  part  of  the  board  ^ways  results  in 
uncertainty  as  the  agent  does  not  know  the 
characteristics  of  positions  outside  its  view). 
Therefore  Stat  is  not  suited  as  an  evaluation 
function  for  a  planner.  Instead,  we  would  like 
to  have  an  operational  evaluation  function  Op 
that  evaluates  the  current  situation  in  terms 
of  directly  observable  features  such  as  the 
number  and  kinds  of  opponents,  of  (worn  and 
unworn)  armors,  of  (wielded  and  unwielded) 
weapons,  of  wands,  of  food,  etc.  A  suit¬ 
able  format  for  such  an  operational  evaluation 
function  would  be  Op{S)  =  x  Fi(S).  In 
the  implemented  adventure  game,  we  employ 
such  a  function  where  the  numerical  features 
Fi  are  similar  to  the  ones  discussed  above.  If 
the  operational  function  is  to  be  relevant  to 
the  ultimate  goal  of  the  system,  as  incorpo¬ 
rated  in  the  static  evaluation  function,  the  two 
functions  should  be  related.  Ideally,  we  shotild 
have  that  when  evaluating  Op{Si)  in  a  situa¬ 
tion  Si  it  tells  us  something  about  Stat{Sn) 
where  Sn  is  the  n-th  situation  after  Si^.  Ide¬ 
ally,  we  should  have  Stat{Sn)  =  Op{S\).^ 


^The  algorithm  stores  the  n  previous  situations  and 
uses  the  oldest  one  as  Si  and  the  most  recent  one  as 
Sn,  n  being  a  parameter  of  the  system. 

^At  this  point,  the  reader  might  believe  that  n 
should  be  equal  to  2.  Whereas  an  approach  where 
n  =2  could  be  followed,  many  important  side  effects 
of  actioxM  are  not  immediate  visible.  E.g.  when  wear¬ 
ing  an  armor,  the  immediate  effect  is  that  the  armor 
is  being  worn,  but  the  side  effect  that  the  agent  is  now 
better  protected  is  only  noticed  later,  when  the  agent 
is  attacked  by  its  opponents.  Therefore  it  is  more  de- 


In  general  this  equation  will  not  be  satisfied. 
When  it  is  not  satisfied  it  is  desirable  to  modify 
the  operational  evaluation  function,  such  that 
it  better  approximates  the  static  one.  There¬ 
fore  we  define  the  updated  evaluation  function 
uOp 

ttOKSi)  =  X  iX(Si)  +  Ei  X  {FiiS,)  - 
Fi(5.))  =  St<U(S,) 

The  updated  function  assumes  that  the  er¬ 
ror  of  the  old  operational  function  is  due  to 
the  parameters  that  changed  in  between  situ¬ 
ation  Si  and  Sn-  The  updated  function  can 
be  used  as  the  basis  of  an  evaluation  function 
learner.  Indeed,  we  can  now  compute  the  up¬ 
dated  weights  w^  —  Wi  +  didLS  follows: 

Siat{Sn)  =  u0p(5i) 

=  X  X  -  Fi(S,)) 

i  ft 

=  Op{S,  )  +  'ZdiX  (Fi(S'.)  -  F(5,)) 

ft 

As  the  equation  is  underdetermined  for  the  di, 
we  have  to  approximate  the  di  making  certain 
assumptions: 

•  the  di  for  which  Fi(5„)  -  Fi{Si)  =  0  are 
assumed  to  be  0,  reflecting  the  principle 
that  the  weights  of  features  should  not  be 
changed  without  evidence;  and 

•  all  non-null  features  are  assumed  to  have 
equal  impact  on  the  error;  therefore  di  x 
{Fi{Sn)  —  Fi{Si))  =  Ct  (2)  is  assumed  to 
be  constant  for  a  constant  Ct  for  these 
features. 

Under  these  assumptions  we  have: 

ct  =  E,<i,x(F,(5,)-fi(5i)) 

m 

where  j  ranges  over  the  m  features  for  which 
—  Fj{S\)  ^  0.  Together  with  equations 

sirable  to  take  slightly  larger  n,  i.e.  n  =  10.  Too  large 
values  for  n  should  be  avoided  as  the  effects  of  the 
actions  disappear  when  waiting  too  long. 
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(1-2),  this  yields  : 

stutjs,)  -  (MS,) 

m 

A  StatjSn)  -  OpjSi) 

'  m  X  (F,(5n)  -  Fi{S^)) 

Our  (preliminary)  experiments  have  shown 
that  applying  this  method  as  its  stands,  has 
two  minor  problems: 

•  the  weights  oscilate  frequently  and  with 
large  values;  this  can  be  avoided  by  re¬ 
placing  m  by  m  X  /  where  /  is  a  user 
defined  parameter; 

•  the  weights  grow  steadily  and  can  become 
very  large;  therefore  it  is  more  appropri¬ 
ate  to  normalize  them  after  each  update. 

Using  these  two  changes,  the  evaluation  func¬ 
tion  learner  learns  rather  quickly.  Given  9  fea¬ 
tures,  random  play,  initial  weights  being  0  (ex¬ 
cept  for  energy  and  gold),  the  learning  mod¬ 
ule  learns  weights  with  a  correct  sign  (i.e.  the 
direction  of  the  influence  of  the  feature  is  cor¬ 
rect)  in  less  than  300  cycles. 

Again,  we  wish  to  stress  here  that  the  only 
knowledge  being  used  is  similsir  to  that  pos¬ 
sessed  by  novice  human  players:  the  fea¬ 
tures  that  are  potentially  infiuence  the  goal 
of  the  game.  Furthermore,  these  featxires 
are  nearly  straightforwardly  derived  &om  the 
knowledge  representation.  One  possible  refine¬ 
ment,  which  we  are  planning  to  investigate,  is 
to  have  a  two  layered  evaluation  fimction,  each 
layer  having  a  format  similar  to  that  above; 
the  outer  layer  would  contain  the  aggregated 
features  such  as  the  number  of  armors,  oppo¬ 
nents  and  weapons,  whereas  the  inner  layer 
would  divide  each  aggregated  feature  in  its 
components.  For  armors,  this  would  include 
the  number  of  helmets,  of  pairs  of  gloves,  of 
boots,  harness,  coats. 


6  Planning  to  survive 

Given  the  evaluation  function  uOp  of  the  pre¬ 
vious  section,  the  knowledge  base  containing 


rules  and  some  objectives,  we  can  use  a  vari¬ 
ant  of  minimax  search  to  select  the  most  in¬ 
teresting  action  to  execute  in  a  given  situa¬ 
tion.  In  adventure  games,  there  are  two  dis¬ 
tinct  primary  objectives:  survival  and  learn¬ 
ing.  Indeed  the  most  important  objective  is 
definitely  to  sxirvive  as  long  as  possible.  How¬ 
ever,  human  players  also  tend  to  experiment 
with  several  actions,  when  there  are  few  risks 
involved,  in  order  to  explore  the  environment 
and  enhance  their  imderstanding  of  the  game. 
Therefore,  an  intelligent  learning  agent  should 
also  follow  this  strategy. 

In  order  to  plan,  thr  agent  starts  from  a  sit¬ 
uation  description  So  and  a  number  of  ac¬ 
tions  Ai,...,An  that  are  executable,  i.e.  le¬ 
gal,  in  5o.  Let  us  assume  that  there  are  rules 
Rit,..,Rk  (for  each  class  of  rules,  only  the 
most  general  rule  considered  to  be  correct,  is 
used  for  prediction)  whose  action  is  similar  to 
one  of  the  actions  (For  the  actions 

Aj+it...,An  there  are  no  rules  in  the  knowl¬ 
edge  base.)  Using  these  rules  to  predict  the 
outcome  of  executing  action  Ai  in  situation  So 
results  in  a  number  of  situation  descriptions 
5ax,i, ...,  Sai^m  when  executing  action  Ai; ...  ; 
in  situation  descriptions  Saj^i,...,Saj^nj  when 
executing  action  Aj.  Here,  ni  (n^)  is  the  num¬ 
ber  of  predicted  situations  when  executing  ac¬ 
tion  Ai  {Aj).  Furthermore  all  situations  have 
an  associated  probability  p, defined  as  the  ra¬ 
tio  CjA  of  the  rule  used  in  the  prediction.  Be¬ 
cause  the  sum  of  these  probabilities  is 

not  necessarily  equal  to  1,  we  normalize  them 
into  likelihoods  The  estimated  value  of  ac¬ 
tion  Ai  is  then  ^(Aj)  =  Y.t  kt  x  E{Sai,t).  The 
best  action  A^  to  be  executed  in  situation  So 
for  the  learning  agent  is  that  with  a  maximum 
E{Aiy,  therefore  E{So)  =  max  E(Ai).  The 
values  of  the  resulting  situation  descriptions, 
the  E{Sai^t),  computed  similarly.  The  only 
difierences  are  that: 

•  We  do  not  assume  that  the  opponents 
choose  the  action  with  a  maximum  value 
for  the  learning  agent,  nor  the  action  with 
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a  mmitnuTn  value,  as  either  of  these  op¬ 
tions  would  imply  that  the  opponents  are 
aware  of  the  aims  of  our  learning  agent 
and  want  to  help  the  agent  achieve  these 
aims  or  to  make  it  fail.  Rather  we  as¬ 
sume  the  opponents  act  independently  of 
the  learning  agent  and  therefore  we  take 
the  average. 

•  All  of  the  opponents  on  the  board  can  ex¬ 
ecute  an  action,  which  affects  the  current 
situation.  Since  we  asstime  all  opponents 
act  ind^endently,  we  define  the  value  of 
a  situation  S  as  the  average  of  the  values 
of  situations  resulting  from  the  actions  of 
the  opponents.^ 

More  formally,  E{Sai,i)  =  llnl2tE{Oi)  (I  = 
Otol  =  n),  where  there  are  n  opponents  0| 
able  to  execute  an  action  in  situation  Sai,t\  and 
E{Oi)  =  {k  =  Qtok  =  mi) 

where  there  are  mi  possible  actions  Ai,  in  sit¬ 
uation  Soi^  that  can  be  executed  by  agent 
Oi.  E{AktOi)  is  then  computed  as  for  £7(i4i) 
above.  Obviously  the  recursion  should  termi¬ 
nate  at  a  certain  level;  this  is  done  by  assigning 
at  that,  E{S)  s  uOp{S)  for  all  situations  5 
occuring  at  that  level.  Also,  situations  with  a 
likelihood  less  than  a  user  defined  parameter, 
are  not  elaborated  further. 

The  computation  of  the  value  of  situation  So 
is  summarized  as  follows  (see  also  Figure  3): 

•  E{So)  is  the  maximum  value  in  the  set 
of  actions  Aj  executable  by  the  agent  in 
situation  So; 

m  the  value  of  an  agent’s  action  Aj  is  the 
average  value  of  the  resulting  situations 
Soi^,  predicted  by  rules  in  the  knowledge 
base; 

•  the  value  of  a  situation  Soi^t  is  the  av¬ 
erage  value  of  the  opponent’s  Oj  in  that 
situation; 

•  the  value  of  an  opponent  Oj  is  the  average 
value  of  the  actions  it  can  execute; 

^This  sMumption  is  not  realistic,  (cf.  Section  7). 


•  the  value  c£  the  actions  c£  an  opp<ment, 
is  the  average  value  of  the  resulting  ntu- 
ations. 


SitiiitionS 
Afent’sAciioosA 
Agent's  Rules  R  for  A 
Predicted  Siuistions  S’ 

Opponents  O 

Actions  A’of  Opponent  O 

Rent’s  Rules  R*  for  Opponent  O  Actioas  A’ 

Predicted  Situstioo  S’* 

'' 


Figure  3:  Structure  of  Search 

The  planning  phase  discussed  above  started 
&om  actions  Ai, ..,  An  for  which  we  hatd  rules 
predicting  the  actions  Ai,...,Ay.  Planning 
only  estimates  E( Ax),  ...,£( A^);  it  does  not 
say  anything  about  S(Ay^i),...,J?(An)«  Fhr* 
thermore,  because  uOp  is  used  to  evaluate  the 
leaves  of  the  search  tree,  it  only  takes  into  ac¬ 
count  the  survival  goal  (i.e.  the  aim  of  the 
game).  To  evaluate  the  interestingness  of  ex¬ 
ecuting  an  action  Aj  with  regard  to  the  learn¬ 
ing  goal,  we  use  the  uncertainty  of  an  action 
(similar  to  [Scott  and  Markovitch,  1989]  and 
to  the  information  content  frequently  used  in 
TDIDT  algorithm  [Quinlan,  1986]):  U{Ai)  s 
~  H  The  higher  the  tmcertainty  of 

an  action,  the  more  interesting  it  is  to  execute. 

Using  these  principles,  planning  for  survival 
and  learning  is  coordinated  as  follows:  if  the 
energy  level  of  the  agent  is  below  a  user  de¬ 
fined  critical  value,  then  the  agent  selects  the 
best  action  according  to  the  survival  objective 
(i.e.  the  action  A  with  highest  E{A)),  else 
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if  the  best  action  B  according  to  the  learn¬ 
ing  objective  has  an  uncertainty  higher  than  a 
user  defined  critical  value,  the  action  B  is  exe¬ 
cuted;  otherwise  an  action  for  which  nothing  is 
known  is  selected  (i.e.  one  of  the  Xj+i, A^. 
Within  the  above  framework,  it  is  straightfor¬ 
ward  to  define  alternative  strategies  (e.g.  max¬ 
imizing  survival). 

7  Preliminary  experiments 

The  current  state  of  the  system  is  such  that 
all  individual  components  are  known  to  work 
well  and  also  that  there  is  some  evidence  that 
the  system  as  a  whole  functions  well: 

•  the  rule  generator  generates  useful  rules; 

•  using  an  autonomous  rule  generator  and 
knowledge  base  mauiagement  results  in  a 
small  number  of  good  rules  being  stored; 
in  one  of  the  experiments  (see  (Bleken, 
1992]),  rules  for  6  actions  were  learned 
from  500  examples;  in  total  about  200 
generalizations  were  computed  and  the  fi¬ 
nal  rule  base  contained  only  10  rules  (this 
means  that  690  rules  were  discarded)  of 
which  2  were  examples  and  the  other  8 
general  and  correct  rules; 

•  the  evaluation  function  learner  quickly 
learns  evaluation  functions  that  make 
sense; 

•  when  using  good  rules,  the  planner  selects 
appropriate  actions,  leading  to  increased 
energy  levels  and  gold  pieces  owned  and 
to  aged  agents  (in  random  play,  the  agent 
dies  quickly,  after  about  30  cycles  on  the 
average). 

Also,  the  learning  system  is  efficient:  to  learn 
and  plan,  the  system  (implemented  on  SUN 
SPARC  using  ProLog  by  BIM)  takes  a  few 
seconds  for  each  cycle;  this  time  stays  ap¬ 
proximately  constant,  also  when  the  knowl¬ 
edge  base  contains  more  rules.  These  prelim¬ 
inary  experiments  are  described  in  detail  in 
[Swennen,  1991;  Chaouat,  1991;  Bleken,  1992; 


Coget,  1993]. 

Furthermore,  in  some  preliminary  experi¬ 
ments,  we  were  able  to  show  bhat  the  sys¬ 
tem  indeed  improves  its  behaviour  over  time 
and  learns  to  siirvive.  In  these  experiments, 
we  started  from  an  empty  knowledge  base,  an 
evaluation  function  with  weigths  initialized  to 
0,  and  /  value  of  200,  and  a  minimax  tree 
that  predicted  one  cycle  of  the  g£une  (the  effect 
resulting  from  the  system  agent’s  action  and 
the  situations  resulting  from  that  by  predict¬ 
ing  the  opponent’s  actions  and  effects).  The 
results  are  shown  in  Figme  4,  and  placed  in 
context  in  Figure  5.  In  Figure  4,  the  thin  line 
denotes  the  score  for  each  of  the  11  games. 
The  average  score  (thin  line)  of  the  last  games 
improves  slow  but  steadily.  This  score  is  plot¬ 
ted  in  Figure  5  against  the  average  perfor¬ 
mance  by  random  play,  by  novice  humans  and 
by  expert  humans  (the  designers  of  the  game). 
Using  our  multi-strategy  learning  system,  the 
system  performs  already  better  than  novice 
human  players.  Nevertheless,  even  after  fur¬ 
ther  training,  the  average  score  becomes  never 
better  than  800-850.  Experts  perform  better. 
We  believe  that  this  is  mainly  because  of  some 
current  shortcomings  of  the  system: 

•  The  inability  to  plan  long  and  compli¬ 
cated  sequences  of  actions;  in  the  hack 
type  of  systems  it  is  often  desirable  to  pur¬ 
sue  a  kind  of  abstract  plan  (e.g.  move  to¬ 
wards  food,  escape  from  your  opponents, 
search  a  particular  item,  etc.),  which  may 
require  many  actions  before  these  plans 
are  satisfied.  Currently  the  system  is 
shortsighted  as  it  can  only  look  ahead  a 
few  steps. 

•  The  inability  to  compute  the  combined 
effects  of  the  opponents.  At  the  mo¬ 
ment,  the  planner  computes  the  average 
of  the  situations  resulting  from  the  indi¬ 
vidual  actions  of  opponents,  thereby  ig¬ 
noring  their  combined  effects.  Two  pos¬ 
sibilities  to  avoid  the  problem  include  1) 
learning  more  complicated  rules  that  take 
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into  account  all  agents  on  the  board,  and 
2)  rather  than  computing  the  effects  of 
the  opponents  in  parallel,  propagate  the 
effects  of  the  opponents,  by  using  sequen¬ 
tial  prediction. 
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Figure  4:  results  of  learning 
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Figure  5:  comparison  with  other  results 

8  Conclusions  and  Related 
Work 

We  have  argued  that  learning  to  survive  and 
act  in  the  context  of  an  adventure  game  is  a 
challenging  and  realistic  task  for  multistrategy 
learning,  having  many  interesting  features  in 
common  with  the  real-world.  We  have  also 
outlined  an  architecture  and  the  components 
of  an  autonomous  lezuning  agent  in  such  en¬ 
vironments.  Preliminary  experiments  indicate 
that  the  agent  performs  quite  well.  We  believe 
the  main  reason  for  this  is  the  integration  of 
a  number  of  learning  strategies  into  a  working 
whole.  In  particular,  our  agent  uses  empir¬ 
ical  learning  from  examples  (by  applying  in¬ 


ductive  logic  programming  techniques)  to  gen¬ 
erate  rules,  a  knowledge  base  manager  to  learn 
from  experience,  a  control  module  that  learns 
an  evaluation  function. 

To  the  best  of  our  knowledge,  the  applica¬ 
tion  of  multi  strategy  learning  to  learn  how 
to  survive  is  new.  Nevertheless,  the  iMmitig 
strategies  are  related  to  other  work.  More 
specifically,  the  nile  generator  relies  on  in¬ 
ductive  logic  programming  [Muggleton,  1992; 
De  Raedt,  1992;  Plotkin,  1970]  techniques  that 
work  from  specific  to  general;  the  e'l^uation 
function  learner  addresses  a  problem  related 
to  that  of  credit  assignment  in  the  bucket 
brigade  algorith  of  (Holland,  1986);  it  is  also 
related  to  the  early  work  of  Arthur  Samuel 
[Samuel,  1967);  the  knowledge  base  manager 
is  related  to  recent  approaches  in  explanation 
based  learning  that  remember  only  the  most 
promising  rules;  and  the  integration  of  plan¬ 
ning  with  learning  is  related  to  reinforcement 
learning. 
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Appendix  1:  the  game 


List  of  opponents:  Humans  |H|,  Dragons  [D 


and  Bats  B 


List  of  objects: 


•  Gold  needed  to  become  rich 

•  Food  ['%]:  needed  to  increase  the  energy 
level,  consuming  cursed  food  decreases 
the  energy  level  however, 

•  Potions  |T}  quaffing  potions  adds  or  re¬ 
moves  curses  from  an  object  of  the  same 
type, 

•  Armors  |Jj:  protect  against  attacks  by  op¬ 
ponents. 


Handweapons  QJ:  wielded  handweapons 
are  needed  to  hit  the  opponents. 


•  Wands  change  the  ctirrent  situation 
of  the  game;  wands  of  death  kill  the 
agent  at  which  it  is  zapped,  wands  of 
teleport  change  the  current  situation  ran¬ 
domly  (except  for  obstacles),  wands  of 
polymorph  change  the  type  of  object  or 
opponent  at  which  it  is  zapped. 
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•  Spells  [?]:  change  the  current  situation 
of  the  game;  killing  spells  have  the  same 
effect  as  hitting  all  opponents  simultane¬ 
ously;  teleport  spells  change  the  current 
situation  randomly;  and  cursing  spells 
add  or  remove  curses  from  objects. 

List  of  Actions: 

•  move(  position  ):moving  to  a  neighbour¬ 
ing  position  which  is  not  an  obstacle  or 
opponent. 

•  pickjjp(  item  ):  pick  up  an  item  on  the 
current  position. 

•  drop(  item  ):  drop  an  item  currently  pos¬ 
sessed. 

•  eat(  food  ):  eat  food  currently  possessed. 

•  hit(  position  ):  hit  an  opponent  on  a  neigh¬ 
bouring  position.  The  hitting  agent  loses 
energy  and  the  opponent  may  but  need 
not  loose  energy. 

•  wieid(  handweapon  ):  wield  a  handweapon 
currently  possessed. 

•  release(  handweapon  ):  release  a 

handweapon  currently  possessed. 

•  wear(  armor  ):  wear  an  armor  currently 
possessed. 

•  takejofF(  armor  ):  take  off  an  armor  cur¬ 
rently  worn. 

•  cast(  spell  ):  cast  a  spell  currently  pos¬ 
sessed.  Spells  can  be  cast  only  once. 

•  quaff(  potion  ):  quaihng  a  potion  cur¬ 
rently  possessed.  Potions  can  be  quaffed 
only  once. 

•  zap(  wand  ):  zap  a  wand  (cxirrently 
possessed)  to  a  neighbouring  position. 
Wands  can  be  zapped  only  once. 

Appendix  2:  representations  of 
situations 

The  following  predicates  are  used  to  describe 
situations,  rules  and  examples. 


•  Agentjon  (  pos,  agent  ); 

•  Itemjon  (  pos.  item  ); 

•  Obstaclejon  (  pos  ); 

•  Possessing  (  item  ); 

•  Wielding  (  item  ); 

•  Wearing  (  item  ); 

•  CursedJ>y  (  item,  curse  )  (only  for  items 
possessed  by  learning  agent); 

•  Un.Cursed  (  item  )  (only  for  items  pos¬ 
sessed  by  learning  agent); 

•  Alive  (learning  agent  is  alive); 

•  Dead; 

•  My_Kind  (  kind  )  (learning  agent  is  of  type 
kind); 

•  My.Energy  (  energy  )  (energy  level  of 
learning  agent); 

•  My.Gold  (  gold  ); 

e  My.Curse  (  curse  )  (learning  agent  is 
cursed); 

•  Un.Cursed; 

e  KindjofAgent  (  agent,  kind  ); 

e  Kind  (  item,  kind  );  possible  kinds  are  : 
gold,  edible,  armor,  spell,  wand,  weapon. 

e  KindjofArmor  (  item,  kind  );  possible 
kinds  are:  boots,  harness,  helmet,  coat, 
gloves. 

•  Kindjof.Weapon  (  item,  kind  );  possible 
kinds  are:  hand-weapon,  wand. 

•  Kind-of-Edible  (  item,  kind  );  possible 
kinds  are:  food,  potion. 

•  Kindjof-Hand-weapon  (  item,  kind  );  pos¬ 
sible  kinds  are:  sword,  dagger,  axe. 

•  Kindjof-Wand  (  item,  kind  )  possi¬ 
ble  kinds  are:  wand_of_polymorph, 

wand-of-teleport,  wand_of_death. 

•  Name-of..spell  (  item,  name  ); 

•  Cursejof-potion  (  item,  curse  ); 

•  NoAgentjon  (  pos  ); 

•  NotJtemjon  (  pos,  item  ); 
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•  No.Obstaclejon  (  pos  ); 

•  Not  J’ossessing  (  item  ); 

•  Not.Wielding  (  item  ); 

•  NotjMearingjnich^n^rmor  (  item  ); 

•  Not.CursedJby  (  item,  curse  ); 

•  Not_My_Curse  (  curse  ); 

The  literals  starting  with  Not  and  No  are  the 
negations  for  the  corresponding  positive  liter¬ 
als.  To  reason  about  the  energy  and  the  gold 
of  the  system  agent,  there  are  some  elemen¬ 
tary  literals  for  comparison  such  as  less  Jhan, 
equal-to  and  lessJhan.jorjequal_to. 
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Abstract 

This  paper  presents  MUSKRAT,  a  Multistrategy 
Knowledge  ReHnement  and  Acquisition 
Toolbox.  MUSKRAT  is  an  open  architecture 
which  supports  the  integration  of  problem 
solvers  and  various  types  of  knowledge  ac¬ 
quisition  tools,  including  knowledge  elicita¬ 
tion,  machine  learning,  and  knowledge  base 
refinement  tools.  All  the  knowledge  acquired 
is  expressed  in  a  Common  Knowledge 
Representation  Language  (CKRL),  and  can  be 
shared  by  several  problem  solvers;  each  tool 
translates  its  internal  knowledge  representa¬ 
tion  formalism  to  or  from  CKRL.  An  advice¬ 
giving  system  compares  the  requirements  of 
the  selected  problem  solver  with  available 
sources  of  information  (knowledge,  data,  hu¬ 
man  expert..)  and  recommends  one  or  more 
knowledge  acquisition  tools,  based  on  a 
knowledge-level  description  of  each  tool.  We 
describe  the  MUSKRAT  architecture,  and  illus¬ 
trate  it  with  a  detailed  description  of  a  proto¬ 
type  currently  being  implemented,  which  in¬ 
cludes  three  problem  solvers  and  four  knowl¬ 
edge  acquisition  tools. 


Key  words:  Knowledge  acquisition. 
Integrated  system.  Machine  learning. 
Knowledge  elicitation.  Knowledge  base 
refinement 

1  Introduction 

Research  into  knowledge-based  systems  orig¬ 
inally  focused  on  building  inference  engines. 
It  then  became  progressively  clear  that  the 
most  significant  bottleneck  was  not  in  the  in¬ 
ference  engine  but  in  the  acquisition  of 
knowledge.  After  considerable  expmence  of 
carrying  out  this  process,  one  valuable  insight 
was  that  knowledge  based  systems  which  at¬ 
tempt  to  address  the  same  sort  of  task  have 
much  in  common,  and  that  once  the  type  of 
problem  solver  was  determined  it  was  much 
easier  to  decide  what  domain  knowledge  was 
required.  Several  researchers  have  attempted 
to  build  taxonomies  of  problem  solving 
(Clancey,  1985;  Hayes-Roth,  1983),  and 
others  suggested  that  specific  tools  should  be 
built  to  acquire  knowledge  for  each  problem 
solving  method  (McDermott,  1988). 

Many  tools  and  techniques  have  been  devel¬ 
oped  for  the  systematic  acquisition  of  domain 
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2  Background 

This  work  originated  in  the  Machine  Learning 
Toolbox  (MLT)  project  The  aim  of  this  pro¬ 
ject  was  to  bring  machine  learning  into  use  on 
real  industrial  problems,  by  (among  other 
tasks)  building  a  collection  of  ML  tools.  This 
toolbox  also  includes  a  number  of  support 
tools,  including  an  advice-giving  system,  the 
Consultant  which  we  developed  at  the 
University  of  Aberdeen  (Craw,  1992).  The 
Consultant  questions  its  user  about  the  task  he 
wants  to  solve,  the  data  and  background 
knowledge  he  can  provide,  etc.,  and  recom¬ 
mends  one  or  more  suitable  learning  tools. 
Although  it  was  found  to  perform  sadsfacto- 
rily,  the  Consultant  suffers  from  a  major  limi¬ 
tation:  it  has  no  understanding  of  the  problem 
that  the  user  wants  to  solve  in  his  application 
domain.  The  user  must  first  decide  what 
knowledge  is  required  to  solve  his  problem, 
i.e.  define  a  learning  task,  and  only  then  can 
the  Consultant  help  him  with  the  choice  of  a 
suitable  tool.  In  other  words,  the  Consultant  is 
told  what  the  user  wants  to  know,  not  what  he 
wants  to  do. 


knowledge,  including  knowledge  elicitation 
(KE)  methods  to  acquire  knowledge  from  a 
human  expert,  machine  learning  (ML)  algo¬ 
rithms  that  infer  knowledge  from  data,  and 
knowledge  base  refinement  (KBR)  tools  that 
refine  knowledge  already  in  a  usable  form.  As 
these  tools  become  mote  sophisticated  and  are 
enhanced  to  deal  with  teal-world  applications, 
their  differences  tend  to  become  less  ^parent 
KE  tools,  which  were  originally  simple  im¬ 
plementations  of  manual  methods,  now  per¬ 
form  more  tasks  automatically,  while  ML  and 
KBR  tools  now  recognise  that  many  interest¬ 
ing  tasks  cannot  be  completely  automated  and 
so  interact  more  with  their  users.  This,  in 
addition  to  the  large  number  of  available 
Knowledge  Acquisition  (KA)  techniques, 
makes  it  very  difficult  for  many  users  to 
choose  an  appropriate  tool  for  their  particular 
application,  especially  when  more  than  one  is 
needed  to  solve  their  problem. 

Our  aim  is  to  relate  the  several  types  of  prob¬ 
lem  solvers,  and  hence  the  kinds  of  knowl¬ 
edge  that  they  require,  with  these  tools.  We 
would  then  be  able  to  give  advice  on  what 
tools  should  be  used  to  acquire,  transform  or 
refine  the  knowledge  so  it  can  be  used  in  a 
particular  problem  solver. 

This  paper  presents  MUSKRAT,  a  MUltiStrategy 
Knowledge  Refinement  and  Acquisition  Tool¬ 
box.  MUSKRAT  is  an  open  architecture  which 
supports  the  integration  of  problem  solvers 
and  KA  tools,  and  assists  the  user  with  the  se¬ 
lection  of  the  most  suitable  KA  tool.  Section  2 
introduces  some  motivations  for  MUSKRAT  in 
relation  with  other  work;  section  3  describes 
the  MUSKRAT  architecture;  section  4  discusses 
the  various  problem  solvers  and  KA  tools 
included  in  the  prototype  that  we  ate  currently 
developing. 


Having  a  model  of  the  target  problem  solver 
would  be  useful,  not  only  to  help  the  user 
specify  his  learning  task,  but  also  to  guide  the 
KA  process  itself.  This  is  generally  acknow¬ 
ledged  in  the  KE  community  (McDermott, 
1988). 

'T^iurently  tbe  main  dieories  of  knowledge  ac¬ 
quisition  are  all  model  based  to  a  certain  extent 
Tbe  model  based  approach  to  knowledge  acqui¬ 
sition  covers  tbe  idea  that  abstract  models  of  tbe 
tasks  that  expert  systems  have  to  perform  can 
highly  facilitate  knowledge  acquisition.”  (van 
Heijst  1992) 
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However,  this  is  not  always  accepted  by  ML 
researchers,  and  ML  systems  that  use  an  ex¬ 
plicit  model  of  problem  solving  are  rare 
(Ganascia,  1993).  The  reason  is  that  an  ab¬ 
stract,  knowledge-level  model  of  a  problem 
solver  is  usually  not  sufficient  to  guide  ML  ef¬ 
fectively,  The  knowledge  acquired  by  "man¬ 
ual"  KE  can  easily  be  adapted  (if  necessary) 
so  as  to  fit  the  requirements  of  a  particular 
problem  solver  (e.g.  in  terms  of  knowledge 
representation),  but  that  obtained  from  an  au¬ 
tomatic  ML  tool  can  only  be  used  directly  if 
the  detailed  needs  of  the  problem  solver  are 
known  in  advance.  Instead,  the  requirements 
are  usually  assumed  and  encoded  implicitly  in 
the  ML  tool.  This  is  even  clearer  with  knowl¬ 
edge  base  refinement  tools,  which  must  gen¬ 
erally  run  a  problem  solver  on  the  knowledge 
they  refine  in  order  to  evaluate  their  modifi¬ 
cations. 

For  these  reasons,  we  decided  that  MUSKRAT, 
a  knowledge  acquisition  toolbox  which  in¬ 
cludes  K£.  ML  and  KBR  tools,  should  also 
include  problem  solving  tools,  which  will 
serve  as  the  targets  of  the  KA  process.  This  is 
in  contrast,  for  instance,  with  KEW,  the 
Knowledge  Engineering  Workbench  produced 
by  the  ACKnowledge  project  (Reichgelt, 
1992).  Since  KEW  focuses  on  KE  techniques, 
it  does  not  include  a  problem  solver,  but  in¬ 
stead  uses  Generalised  Directive  Models  to 
guide  tool  selection  (van  Heijst,  1992). 

The  integration  of  learning  and  problem 
solving  is  also  a  major  issue  in  the  field  of  in¬ 
tegrated  systems  (SIGART,  1991;  VanLehn, 
1991).  Some  such  systems  integrate  one  or 
several  KA  tools  with  a  problem  solver  (as  in 
PRODIGY  (Carbonell,  1991)),  Others  inte¬ 
grate  KA  and  problem  solving  in  a  single 
component,  using  a  uniform  technique  (THEO 


(Mitchell,  1992),  SOAR  (Laird,  1991)).  In 
both  cases,  the  knowledge  base  is  tied  to  a 
particular  problem  solver.  In  contrast, 
MUSKRAT  integrates  existing,  stand-alone  KA 
tools  with  existing,  stand-alone  problem 
solvers,  so  that  the  knowledge  can  be  tested 
independently  and  shared  among  several 
problem  solvers.  Knowledge  sharing  and 
reuse  is  further  supported  by  the  fact  that  all 
the  knowledge  acquired  by  the  system  is  ex¬ 
pressed  in  a  single  representation  language, 
caUedCKRL. 

Finally,  the  selection  of  an  appropriate  KA 
tool  is  also  an  issue  in  many  multistrategy 
learning  systems  (Michalski,  1991).  Some 
multistrategy  systems  include  several  ML 
techniques  (for  instance  both  symbolic  and 
sub-symbolic  algorithms)  which  are  applied 
successively  to  generate  a  single  knowledge 
base.  In  MUSKRAT,  the  knowledge  to  be  ac¬ 
quired  is  structured  into  several  knowledge 
bases,  each  of  which  is  obtained  with  an  ap¬ 
propriate  technique.  Other  multistrategy  sys¬ 
tems  include  several  similar  techniques  and 
use  highly  discriminating  selection  criteria  to 
select  the  most  suitable  one,  but  we  are  not 
aware  of  any  system  that  covers  as  broad  a 
range  of  techniques  as  MUSKRAT,  including 
KE,  ML  and  KBR  tools. 

3  The  MUSKRAT  Architecture 

The  acquisition  of  control  knowledge  (i.e.  a 
problem  solving  method)  and  domain  knowl¬ 
edge  can  be  performed  in  any  relative  order. 
In  the  MUSKRAT  framework,  we  assume  that 
the  selection  of  one  or  several  KA  techniques 
proceeds  along  the  following  lines: 

1.  Identify  an  application  task,  i.e.  a  problem 
to  be  solved  in  a  particular  domain. 
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2.  Select  a  suitable  problem  solver  to  solve 
this  task.  If  no  single  problem  solver  can 
be  identified,  it  may  be  necessary  to  split 
the  application  task  into  sub-problems  that 
can  each  be  solved  by  a  problem  solver. 

3.  For  each  selected  problem  solver,  deter¬ 
mine  what  knowledge  bases  are  required. 
This  amoimts  to  a  knowledge-level  analy¬ 
sis  of  each  problem  solver,  which  needs 
only  be  done  once  since  it  does  not  depend 
on  a  particular  application  task. 

4.  For  each  required  knowledge  base,  com¬ 
pare  the  problem  solver's  requirements 
with  whatever  knowledge  sources  are 
available  (human  expert,  examples,  exist¬ 
ing  knowledge,  etc.).  This  defines  one  or 
more  KA  tasks. 

5.  Select  a  KA  tool  capable  of  solving  each 
KA  task,  i.e.  bridging  the  gap  between  re¬ 
quited  and  available  knowledge.  This  sup¬ 
poses  a  preliminary  knowledge-level  anal¬ 
ysis  of  available  tools. 

6.  Apply  the  selected  KA  tool. 

These  steps  can  be  repeated  in  a  cycle,  espe¬ 
cially  if  information  acquired  in  step  6  is 
needed  to  refine  the  decisions  made  in  step  2. 

The  MUSKRAT  system  is  designed  to  support 
steps  3  to  6.  It  assumes  that  a  problem  solver 
has  been  selected  for  a  particular  task  or  sub¬ 
task,  and  directs  the  acquisition  of  knowledge 
for  this  particular  problem  solver.  The  system 
consists  of  any  number  of  problem  solvers, 
any  number  of  KA  tools,  and  a  guidance 
module,  the  KA  selector. 

The  architecture  is  centred  around  a  set  of 
Knowledge  Bases  (KBs),  which  is  the  inter¬ 


face  between  KA  tools  and  probtem  solvers. 
We  deHne  a  KB  as  any  body  of  knowledge 
required  by  a  problem  solver.  This  includes 
not  only  “conventional"  knowledge  bases  (e.g. 
rule  sets),  but  also  representation  languages, 
control  heuristics,  etc. 

In  MUSKRAT,  all  KBs  are  expressed  in  the 
same  representation  language,  CKRL.  CKRL 
(Common  Knowledge  Representation 
Language)  is  an  information  interchange  lan¬ 
guage  developed  as  part  of  the  MLT  project 
(Morik,  1991).  It  is  not  directly  execut^^le, 
but  consists  of  declarations  that  can  be 
lated  into  a  tool's  internal  representatioi.  o 
ensure  that  this  translation  is  possible  into  a 
broad  range  of  representation  languages,  and 
unambiguous,  CKRL  entities  are  defined  at 
the  epistemic  level  (Brachman,  1979):  con¬ 
cepts,  instances,  relations,  properties,  sorts, 
rules,  etc.  Although  CKRL  was  originally  de¬ 
signed  as  a  communication  medium  for  ML 
tools,  it  is  general  enough  to  be  useful  in 
many  situations  where  knowledge  is  to  be 
transmitted  or  processed  in  a  number  of  ways. 

Our  choice  of  a  uniform  knowledge  represen¬ 
tation  was  motivated  by  considerations  of 
knowledge  sharing  and  reuse:  a  KB  can  be 
used  by  several  problem  solvers,  even  if  this 
was  not  anticipated  when  the  KB  was  created. 
It  also  allows  the  integration  of  new  problem 
solvers  and  KA  tools  into  MUSKRAT  at  the 
cost  of  implementing  a  single  interface  to  or 
from  CKRL.  An  additional  advantage  of 
choosing  CKRL  is  that  some  of  the  KA  tools 
in  our  prototype  ate  also  part  of  the  MLT,  and 
therefore  already  express  their  output  in  this 
language  (see  section  4.2). 


In  figure  1,  circles  represent  bodies  of  knowl¬ 
edge  and  boxes  represent  MUSKRaTs  compo¬ 
nents.  There  are  three  types  of  boxes:  thin 
boxes  represent  problem  solvers,  black  boxes 
represent  KA  tools,  and  thick  boxes  represent 
advice-giving  systems.  The  types  of  knowl¬ 
edge  provided  to  MUSKRAT  are  an  initial 
problem  description  (top)  and  various  sources 
of  problem  solving  knowledge  (bottom  left), 
and  the  system  is  used  to  produce  one  or  mote 
KBs  (middle). 


This  architecture  can  be  used  at  two  different 
stages  of  the  problem  solving  cycle:  (a)  the 
selection  of  suitable  tools,  and  (^)  the  use  of 
the  tools  to  acquire  knowledge  and  solve  the 
problem.  We  will  now  describe  these  two 
stages  in  detail,  and  explain  the  role  of  each 
component. 

The  tool  selection  process  starts  with  an  initial 
description  of  the  problem.  An  advice-giving 
system,  the  PS  selector,  helps  the  user  with 
the  selection  of  a  suitable  problem  solver.  The 
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PS  selector  is  not  currently  part  of  MUSKRAT, 
but  a  module  similar  to  KEW's  advice  and 
guidance  module  (van  Heijst,  1992)  should  be 
usable  here. 

Once  a  problem  solver  has  been  selected. 
MUSKRAT  knows  which  KB(s)  are  required. 
For  instance,  in  Hgure  1,  if  we  want  to  use  the 
first  problem  solver  (Problem  Solver  1)  we 
will  have  to  acquire  three  KBs:  KBl,  KB2  and 
KB3.  At  this  stage,  each  KB  is  only  specified 
in  terms  of  its  functionalities  and  representa¬ 
tion,  as  required  by  the  problem  solver.  These 
requirements  are  expressed  in  a  formalism 
which  provides  descriptors  for  both  knowl¬ 
edge-level  and  symbol-level  features.  We  are 
currently  deHning  such  a  formalism  to  de¬ 
scribe  the  effects  and  requirements  of  the  par¬ 
ticular  tools  included  in  the  MUSKRAT  proto¬ 
type  (section  4).  Since  these  tools  cover  a 
fairly  broad  range  of  techniques,  we  expect 
that  our  formalism  will  be  easily  ex- 
tended/reflned  to  be  applicable  to  most  exist¬ 
ing  KA  techniques. 

The  next  step  is  to  identify  the  available 
knowledge  sources  (bottom  left  in  the  figure). 
We  consider  three  broad  categories  of  knowl¬ 
edge  sources:  available  knowledge  refers  to 
knowledge  that  is  already  in  the  form  require 
for  a  KB,  e.g.  a  set  of  rules.  It  may  be  directly 
usable  or  require  further  transformation  or  re- 
Hnement.  Note  that  knowledge  is  seldom 
available  initially,  but  when  MUSKRAT  is  used 
iteratively  as  part  of  a  problem  solving  cycle, 
“available  knowledge”  refers  to  that  acquired 
during  a  previous  iteration.  Available  data 
refers  to  data  that  is  relevant  to  the  problem 
and  from  which  useful  information  could  be 
extracted,  although  it  does  not  meet  the  re¬ 
quirements  of  the  KB.  Typically,  this  may 
consist  of  past  cases,  i.e.  previously  solved 
problems  similar  to  the  one  at  hand,  from 


which  insight  into  the  new  problem  can  be 
gained.  Alternatively,  if  the  problem  is  to  di¬ 
agnose  faults  in  a  complex  system,  “available 
data”  may  refer  to  a  model  of  the  system, 
which  is  useful  (or  perhaps  necessary)  to  per¬ 
form  diagnosis.  Note  that  the  distinction  be¬ 
tween  knowledge  and  data  is  not  intrinsic  but 
depends  on  the  KB  requirements.  For  in¬ 
stance,  a  set  of  past  cases  is  considered  as 
knowledge  if  it  is  to  be  used  by  a  case-based 
reasoner  that  can  use  it  directly,  but  it  is  only 
data  for  a  rule-based  system  which  is  unable 
to  reason  from  cases.  Finally,  an  expert  is  a 
person  who  can  provide  various  forms  of 
knowledge,  possibly  with  the  help  of  a  K£ 
tool  and/or  a  knowledge  engineer. 

The  KA  selector  is  the  central  component  of 
MUSKRAT.  It  compares  the  requirements  of 
the  selected  problem  solver  with  the  charac¬ 
teristics  of  available  knowledge  sources  and 
recommends  the  use  of  one  or  more  KA  tools. 
For  that  purpose,  it  has  a  knowledge-level  de¬ 
scription  of  each  available  KA  tool  and  per¬ 
forms  a  means-ends  analysis  to  decide  which 
one  is  most  capable  of  reducing  the  differ¬ 
ences. 

Since  there  are  three  types  of  knowledge 
sources  that  the  KA  selector  can  decide  to  use 
or  not  use,  eight  (2^)  combinations  could  be 
considered.  For  each  combination,  a  suitable 
KA  technique  must  be  identified.  The  four 
most  common  combinations  are  represented  in 
figure  1  by  vertical  lines  ending  in  each  of  the 
four  KBs: 

1.  Available  knowledge  exactly  matches  re¬ 
quired  knowledge,  and  can  thus  be  used 
directly,  or  possibly  with  some  syntactic 
data  manipulation  (not  represented).  This 
is  shown  as  an  empty  black  box,  repre¬ 
senting  direct  transfer  of  knowledge. 


2.  Available  knowledge  can  be  used  by  the 
selected  problem  solver,  but  does  not  pro¬ 
duce  satisfactory  results  (this  normally  oc¬ 
curs  when  MUSKRAT  is  used  iteratively, 
and  the  knowledge  in  question  was  ob¬ 
tained  and  tested  during  a  previous  itera¬ 
tion).  A  KBR  tool  can  then  be  used  to  re¬ 
fine  this  knowledge.  A  domain  expert  is 
usually  required,  either  during  the  refine¬ 
ment  process  or  only  to  validate  the  result¬ 
ing  KB 

3.  If  there  is  not  enough  or  no  available 
knowledge  but  other  data  is  present,  an 
ML  tool  can  be  used  to  extract  useful 
knowledge.  Depending  on  the  selected 
tool,  existing  (incomplete)  knowledge  may 
also  be  used,  and  an  expert  may  be  re¬ 
quired  to  interact  with  the  ML  tool.  In  any 
case,  an  expert  is  almost  always  required 
to  validate  the  newly  acquired  KB. 
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Figure  2:  Flow  of  infonnatioo  for  tool  sdectioo. 


4.  If  no  other  source  is  available,  the  KA  se¬ 
lector  still  has  the  resource  to  recommend 
a  K£  tool,  which  will  attempt  to  obtain  the 
required  knowledge  directly  from  an  ex¬ 
pert,  using  a  more  or  less  systematic 
methodology. 

Another  plausible  situation  (not  illustrated)  is 
that  no  tool  is  available  to  produce  the  re¬ 
quired  KB.  In  this  case,  the  KA  selector  will 
merely  describe  the  requirements  to  an  expert 
and  let  him  provide  this  knowledge 
“manually”.  The  help  of  a  knowledge  engi¬ 
neer  will  usually  be  necessary  in  this  case;  it  is 
also  desirable  in  the  previous  cases. 

The  tool  selection  process  is  summarised  in 
figure  2,  where  thick  arrows  represent  the 
flow  of  information  which  converges  from 
two  different  directions  into  the  KA  selector. 


Once  appropriate  problem  solvers  and  KA 
tools  have  been  selected,  they  can  be  used  to 
actually  solve  the  problem.  Since  MUSKRAT 
only  performs  the  integration  of  independent 
tools,  it  provides  no  support  with  the  use  of 
individual  tools.  At  this  stage,  its  role  is  lim¬ 
ited  to  the  communication  of  knowledge  be¬ 
tween  different  tools.  The  flow  of  informa¬ 
tion,  illustrated  in  figure  3,  shows  knowledge 
coming  out  of  knowledge  sources,  being  pro¬ 
cessed  by  KA  tools,  and  finally  used  by  a 
problem  solver.  It  is  the  user's  responsibility 
to  evaluate  the  solution  obtained  by  the  prob¬ 
lem  solver  and,  if  necessary,  to  start  a  new  KA 
cycle. 


4.1  Problems  and  problem  solvers 
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Figure  3:  Flow  of  infomution  for  problem  solving. 


4  The  MUSKRAT  Prototype 

To  illustrate  our  approach,  we  have  chosen  a 
“toy”  domain  where  several  problems  can  be 
identified  which  require  different  problem 
solvers,  but  where  some  of  the  KBs  required 
can  be  shared  by  at  least  two  problem  solvers. 
We  are  currently  developing  a  prototype  tool* 
box  including  three  different  problem  solvers 
and  four  KA  tools,  selected  to  suit  these  par¬ 
ticular  problems.  All  these  tools  are  described 
in  more  detail  below. 

It  should  be  noted  that,  since  the  main  focus 
of  our  work  is  on  knowledge  acquisition 
rather  than  problem  solving,  we  decided  to 
keep  the  problem  solvers  fairly  simple,  even  if 
this  implies  that  we  can  only  solve  simpliried 
versions  of  our  original  problems. 


The  domain  that  we  consider  is  the  planning 
of  a  meal.  It  includes  three  distinct  problems: 
selecting  dishes  given  a  set  of  constraints, 
analysing  and  criticising  a  selected  menu,  and 
scheduling  the  meal  preparation  given  time 
constraints  and  limited  resources. 

Enhanced  version  of  these  problem  solvers 
will  later  be  applied  to  similar,  though  much 
larger,  problems  in  the  domain  of  flexible 
manufacturing,  namely  the  customised  design 
of  mechanical  devices  under  specific  con¬ 
straints,  analysis  of  existing  designs,  and 
flexible  workshop  scheduling. 

4.1.1  Constraint  satisfaction 

The  problem  can  be  described  as  follows: 
given  a  set  of  constraints,  select  a  menu  (from 
a  pre-deflned  list  of  dishes)  that  satisfies  the 
largest  number  of  constraints.  Examples  of 
constraints  include:  “the  meal  should  include 
a  starter,  a  main  course  and  optionally  a 
dessert”,  “select  a  vegetarian  meal”,  “at  most 
one  dish  may  include  sea  food”,  “the  total 
price  must  not  exceed  AT,  etc. 

If  all  the  constraints  cannot  be  met  simultane¬ 
ously,  the  system  must  decide  which  con- 
straint(s)  should  be  relaxed  first 

The  KBs  required  to  solve  this  problem  are: 

Al.  A  set  of  descriptors  (attributes)  used  to 
represent  dishes.  We  only  use  boolean- 
and  numeric-valued  descriptors. 
Examples  include  “has-meat”,  “warm”, 
“is-starter^’,  “cosf ’,  etc. 

A2.  A  set  of  dish  descriptions.  A  description 
consists  of  a  list  of  ingredients,  and  the 


115 


valt^  of  general  descriptors  such  as  “is- 
soup**  or  “warm”. 

A3.  A  set  of  rules  that  infer  the  values  of  de¬ 
scriptors  from  ingredients  and/or  other 
descriptors.  These  rales  are  used  to  gen¬ 
erate  complete  internal  representations  of 
dishes  from  incomplete  user-provided 
descriptions.  Examples  include  “beef  € 
ingredients  =»  has-meat”,  *iias-meat  => 
-^vegetarian”  and  “cost  >  S  ^  expen¬ 
sive”. 

A4.  Predefined  constraints  that  can  be  used  in 
queries,  such  as  “vegetarian-meal  s 
all(is-vegetarian)”  or  “cheap-meal  s 
sum(cost)  <  £6”.  A  query  can  combine 
any  number  of  predefined  constraints  and 
user-defined  constraints  written  using  a 
fixed  set  of  operators. 

A5.  Meta-rules  that  tell  the  system  which 
constraint  to  relax  when  all  the  con¬ 
straints  cannot  be  satisfied  simultane¬ 
ously.  For  instance,  a  rule  might  say  that 
cost-related  constraints  ate  less  important 
than  dietary  constraints,  or  that  the  num¬ 
ber  of  courses  is  the  most  important  con¬ 
straint 

4.1,2  Design  analysis 

The  purpose  of  this  problem  solver  is  to  take  a 
menu  generated  by  the  first  problem  solver  or 
any  other  source,  and  to  issue  a  list  of  com¬ 
ments,  and  suggestions  for  possible  improve¬ 
ments.  A  typical  output  from  this  module 
could  be  ‘This  meal  supplies  2/3  of  the  rec¬ 
ommended  daily  allowance  of  carbohydrates”, 
or  “This  meal  is  unbalanced  because  it  con¬ 
tains  two  sea  food  dishes;  you  should  replace 
fish  soup  with  vegetable  soup”. 


The  KBs  requited  by  this  problem  solver  are: 

Bl.  A  set  of  dish  descriptors  (identical 
toAl). 

B2.  A  list  of  all  the  dishes  that  may  !q>pear  in 
menus  (identical  to  A2). 

B3.  A  set  of  description  expansion  rules 
(identical  to  A3). 

B4.  A  set  of  rules  that  derive  comments  and 
recommendations  from  descriptors,  for 
instance  “count(seafood)  >  1  ^  com- 
ment(too-much-seafood)”.  Each  com¬ 
ment  is  associated  with  a  canned  piece  of 
English  text 

4.13  Task  scheduling 

Once  a  menu  has  been  selected,  this  problem 
solver  can  be  used  to  generate  a  plan  to  pre¬ 
pare  it.  Each  dish  has  a  recipe,  which  is  a 
fixed,  partially  ordered  list  of  actions.  The 
problem  is  to  set  the  starting  time  of  the  tasks 
involved  in  the  recipes  of  ail  the  dishes  in  the 
menu,  so  as  to  meet  a  set  of  time  and  resource 
constraints.  A  time  constraint  may  be  that  two 
dishes  must  be  ready  and  warm  at  the  same 
time;  a  resource  constraint  may  be  that  only 
one  oven  is  available,  therefore  at  most  one 
dish  can  be  baked  at  a  time. 

The  KBs  required  by  this  problem  solver  are: 

Cl.  A  list  of  dishes  with  associated  recipes. 
This  is  a  superset  of  A2,  since  a  recipe 
includes  a  list  of  ingredients  and  an  or¬ 
dered  list  of  actions. 

C2.  A  list  of  available  resources,  and  the 
amount  of  each  resource  that  is  available. 
These  resources  are  supposed  to  be  per¬ 
manent  and  usable  for  any  number  of 
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tasks  (e.g.  an  oven);  resources  that  are 
consumed  by  a  particular  task  are  listed 
as  ^^ingredients”.  Ingredients  are  always 
assumed  to  be  available  in  sufficient 
amounts,  since  they  have  to  be  purchased 
fOT  each  dish. 

4,2  Knowledge  acquisition  tools 

Unlike  the  above  problem  solvers,  which  are 
being  implemented  as  part  of  this  work,  the 
KA  tools  described  in  this  section  were  devel¬ 
oped  independendy  by  other  researchers.  Our 
goal  is  to  show  that  they  can  be  made  to  work 
together  to  produce  complementary  KBs,  with 
minimal  modifications.  One  significant  en¬ 
hancement  that  has  to  be  made,  however,  is 
that  they  must  all  express  their  output  in  the 
same  knowledge  representation  language, 
CKRL.  This  is  already  the  case  for  those 
which  are  part  of  the  MLT,  namely  APT  and 
KBG. 

4.2.1  Repertory  grid 

The  repertory  grid  is  a  KE  technique  derived 
from  cognitive  psychology  (Kelly,  1955).  It 
provides  a  systematic  way  of  interactively 
eliciting  elements  (examples)  and  constructs 
(descriptors)  from  an  expert  Although  it  is 
fundamentally  a  methodology,  it  can  be  sup¬ 
ported  by  software  tools  such  as  Tacktix 
(Reichgelt  1992)  that  not  only  acquire  this 
knowledge  but  also  compute  similarities  and 
correlations  between  elements  and  between 
constructs. 

In  our  application,  this  tool  is  used  to  acquire 
simultaneously  dish  descriptors  (A1  or  Bl) 
and  descriptions  (A2  or  B2),  since  these  KBs 
must  be  acquired  directly  from  an  expert.  In 
addition,  correlations  between  descriptors 
suggest  possible  rules  for  A3  (or  B3),  al¬ 


though  those  can  be  more  adequately  at^uired 
by  KBG  (see  below). 

4.2.2  KBG 

KBG  (Bisson,  1992)  is  an  ML  clustering  and 
generalisation  tooL  It  can  either  take  unclassi¬ 
fied  examples  and  cluster  tlmm  according  to  a 
particular,  flexible  metric,  or  take  classified 
examples  and  induce  discrimination  rules.  In 
both  cases,  it  can  also  use  background  knowl¬ 
edge  in  the  form  of  rules  to  complete  example 
descriptions.  An  interesting  feature  of  KBG  is 
that  its  learning  examples  and  output  rules  are 
expressed  in  (restricted)  first-order  logic, 
which  means  in  particular  that  all  the  exam¬ 
ples  need  not  be  represented  by  the  same  de¬ 
scriptors. 

In  our  example,  KBG  is  used  to  infer  rules  for 
A3  (and  B3):  given  a  small  number  of  com¬ 
plete  dish  descriptions,  it  finds  correlations 
between  descriptors  that  can  later  be  used  to 
complete  new  (incomplete)  descriptions. 

It  can  also  be  used  to  infer  control  rules  for 
A5.  In  this  case,  a  learning  example  is  a  set  of 
constraints  and  an  indication  of  which  one 
should  be  dropped.  Since  constraints  are 
complex  objects  that  carmot  be  represented  as 
attribute/value  pairs,  KBG's  first  order  repre¬ 
sentation  is  very  suitable  for  this  task. 

Finally,  its  clustering  and  concept  formation 
ability  can  be  used  to  select  useful  predefined 
constraints  (A4).  Since  such  constraints  ate 
provided  only  for  user  convenience,  it  is  use¬ 
ful  to  detrct  patterns  that  occur  frequently  in 
user-defined  constraints,  and  add  them  to  the 
set  of  pre-defined  constraints.  KBG  can  help 
with  this  pattern  detection. 
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4^  APT 

AFT  (N^dellec,  1992)  uses  a  combination  of 
KE  and  ML  techniques  to  acquire  problem 
solving  rules.  It  starts  with  a  domain  theory,  in 
the  form  of  a  semantic  network,  and  possibly 
an  initial  set  of  rules.  When  it  cannot  solve  a 
problem  with  its  rules,  it  asks  the  user  for  a 
particular  solution,  then  uses  the  domain  the¬ 
ory  to  generalise  it  The  user  is  constantly  re¬ 
quested  to  validate  the  rules  generated  by  the 
system,  and  he  can  extend  the  domain  theory 
if  necessary  to  enable  APT  to  infer  correct 
generalised  rules. 

When  a  limited  domain  model  is  available, 
APT  can  be  used  as  a  KE  tool  to  acquire  rules 
and  enhance  the  model.  When  an  important 
set  of  rules  is  available,  it  can  be  seen  as  an 
interactive  KBR  tool.  These  capabilities,  to¬ 
gether  with  its  rich  knowledge  representation 
(semantic  network)  make  it  suitable  to  acquire 
analysis  rules  (B4),  as  well  as  recipes  (Cl) 
that  can  be  regarded  as  problem  decomposi¬ 
tion  rules. 

4.2.4  KRUST 

KRUST  (Craw,  1990)  is  an  automatic  KBR 
tool.  Given  a  set  of  Prolog-like  rules,  and  an 
example  incorrectly  classified  by  these  rules, 
it  considers  many  possible  remedies 
(generalising  or  specialising  rule  premises,  re¬ 
ordering  rules,  adding  new  rules...),  tests  them 
against  known  cases  and  implements  the  most 
successful  ones.  It  occasionally  consults  an 
expert  to  validate  its  recommendations. 

In  our  prototype,  KRUST  uses  examples  of 
menus  commented  on  by  an  expert  to  refine 
the  “comment”  rules  (B4). 


43  Summary 

The  following  table  summarises  the  relation¬ 
ships  between  MUSKRAT'S  components.  For 
each  knowledge  base,  it  shows  which  KA 
tools  can  generate  it  and  which  problem 
solvers  can  use  it 


KB 

oeatedby 

used  by 

A1.B1 

Grid 

,  analysis 

A2.B2 

Grid 

COQSL  saL,  analysis 

A3.B3 

KBG.Grid 

const  sat,  analysis 

A4 

KBG 

const  sat 

A5 

KBG 

const  sat 

B4 

APT.  KRUST 

analysis 

Cl 

APT 

scheduling 

C2 

(no  tool) 

scheduling 

5  Conclusion 

We  have  presented  an  architecture  that  allows 
the  integration  of  independent  problem 
solving  and  knowledge  acquisition  tools  into  a 
uniform  framework.  Integration  is  achieved 
by  means  of  two  common  languages:  a 
knowledge-level  description  of  the  tools  (used 
by  the  KA  selector  to  provide  advice  and 
guidance),  and  a  uniform  representation  of 
data  (which  encourages  knowledge  sharing 
and  reuse).  A  prototype  is  currently  being  im¬ 
plemented  to  validate  this  architecture.  In  par¬ 
ticular,  we  will  need  to  evaluate  the 
extensibility  of  the  system.  When  a  new  tool 
is  added,  a  knowledge-level  description  of  its 
functionality  must  be  provided.  It  is  at  present 
difficult  to  estimate  the  effort  required  to 
derive  such  a  model.  A  challenging  goal  for 
future  research  would  be  to  generate,  or,  more 
realistically,  to  refine,  knowledge-level 
models  of  problem  solvers  by  letting  the 
system  perform  its  own  experiments.  This 
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opens  interesting  perspectives  in  the  field  of 
autonomous  multistrategy  learning  systems. 
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Abstract 

The  paper  is  concerned  with  supervised  learning 
of  numeric  target  concepts.  The  task  is  to  leam 
to  predict  or  determine  the  exact  values  of  some 
numeric  target  variables.  Training  examples 
may  be  described  by  both  symbolic  and  numeric 
predicates.  General  domain  knowledge  may  be 
available  in  qualitative  form.  The  paperpresents 
a  general  learning  model  for  such  domains.  The 
model  integrates  a  symbolic  learning  compo¬ 
nent,  which  is  based  on  a  multi-instance  plausi¬ 
ble  explanation  algorithm,  and  an  instance- 
based  learning  component,  which  stores 
instances  with  precise  values  and  predicts  new 
values  by  interpolation.  The  symbolic  compo¬ 
nent  can  use  available  qualitative  background 
knowledge;  it  learns  suh-concepts  that  partition 
the  space  for  the  underlying  instance-based 
method.  A  realization  of  the  model  in  a  system 
named  IBL-^mart  is  then  described.  The  sys¬ 
tem  has  been  applied  to  a  complex  task  from  the 
domain  of  tonal  music,  and  some  experimental 
results  are  reported  that  demonstrate  the  effec¬ 
tiveness  of  the  method. 

Key  words:  Knowledge-based  learning, 
instance-based  learning,  integrated  learning, 
qualitative  models. 


1  Introduction 

It  is  being  recognized  by  more  and  more  re¬ 
searchers  that  qualitative  background  knowl¬ 
edge  is  naturally  available  in  many  domains, 
and  that  learning  algorithms  are  needed  that  can 
effectively  use  such  knowledge,  even  if  it  is  in¬ 
complete  and  inconsistent,  and  generally  ab¬ 
stract  and  imprecise.  Some  approaches  to  this 
problem  have  been  proposed  in  the  recent  past, 
most  of  them  centering  around  the  notion  of  in¬ 
complete  or  plausible  explanations  (see,  e.g., 
Tecuci,  1991;  Tecuci  &  Michalski,  1991;  Wid¬ 
mer,  1991).  All  these  methods  and  systems  as¬ 
sume  that  the  target  concepts  are  discrete  classes 
of  objects,  to  be  described  by  classification  rules 
which  assign  a  new  object  to  its  appropriate 
class. 

However,  there  are  also  many  learning  prob¬ 
lems  with  numeric  target  concepts,  i.e.,  where 
the  task  is  to  predict  more  or  less  precisely  the 
values  of  some  numeric  variables.  The  training 
instances  may  be  described  by  both  symbolic 
and  numeric  predicates.  In  such  domains,  too, 
general  domain  knowledge  may  be  available 
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that  relates  certain  parameters,  but  maybe  only 
in  a  qualitative,  imprecise  way.  As  an  example, 
consider  typical  prediction  tasks  such  as  stock 
market  prediction  or  the  prediction  of  energy 
consumption  or  demand  in  some  power  plant. 
One  can  easily  conceive  of  partial  qualitative 
models  of  these  domains  that  would  capture 
some  of  the  relevant  domain  knowledge.  Intelli¬ 
gent  learners  should  be  able  to  utilize  such  ab¬ 
stract  knowledge. 

With  few  exceptions  (e.g.,  regression  trees  - 
Breiman  et  al.,  1984),  ‘classical’  symbolic 
learning  methods  cannot  be  used  for  such  nu¬ 
meric  problems  (unless  the  domain  of  the  nu¬ 
meric  target  concept  can  be  abstracted  into  dis¬ 
crete,  qualitative  subranges  without  loss  of  rele¬ 
vant  information).  In  particular,  plausible  ex¬ 
planation  methods  capable  of  utilizing  qualita¬ 
tive  domain  knowledge,  like  those  mentioned 
above,  are  not  applicable  to  such  domains;  they 
assume  discrete  target  concepts,  and  it  is  not 
clear  how  the  qualitative  background  knowl¬ 
edge  should  be  related  to  the  precise  numeric  in¬ 
formation  in  the  data. 

The  topic  of  this  paper  is  a  new  learning  model 
(and  an  implemented  system)  that  can  learn  nu¬ 
meric  target  concepts  while  taking  maximum 
advantage  of  available  qualitative  domain 
knowledge,  given  that  the  problem  and  the  tar¬ 
get  concepts  satisfy  some  basic  assumptions. 
The  approach  consists  essentially  in  using  a 
symbolic  learner  to  partition  the  space  for  an 
instance-based,  numeric  method  that  is  used  to 
predict  precise  values  of  the  target  variables. 
The  symbolic  learner  produces  plausible  ex¬ 
planations  for  discrete  subconcepts;  the  ex¬ 
planations  (and  the  extracted  rules)  are  based 
both  on  qualitative  background  knowledge  and 
on  empirical  information  from  the  training  data. 
The  underlying  instance-based  method  stores 


the  examples  in  several  independent  instance 
spaces  and  uses  the  learned  symbolic  rules  to 
decide  which  instance  space  is  relevant  to  a  giv¬ 
en  example.  Values  for  new  examples  are  then 
predicted  by  numoic  interpolation  in  those 
instance  spaces  that  are  classified  as  relevant  by 
the  associated  symbolic  rules. 

The  motivation  for  this  research  was  a  practical 
and  complex  problem  in  the  domain  of  tonal 
music,  namely,  learning  to  apply  expressive  in¬ 
terpretation  to  a  given  piece  of  music,  i.e.,  to  dy¬ 
namically  vary  tempo  and  dynamics  in  order  to 
produce  a  musically  satisfying  performance. 
The  target  concepts  in  this  problem  are  neces¬ 
sarily  numeric  (exactly  how  much  variation 
should  be  applied  to  a  given  note),  and  there  is 
some  natural  domain  knowledge  that  is  relevant 
to  the  task.  The  domain  knowledge  comes  from 
music  theory  and  is  inherently  qualitative  and 
incomplete,  but  describable  in  explicit  form. 

The  model  has  been  implemented  in  a  learning 
system  named  IBL-Smart  (for  reasons  that  will 
become  obvious  soon).  We  will  first  present  the 
general  learning  model,  then  describe  its  real¬ 
ization  in  the  system  IBL-Smart,  and  illustrate 
its  applicability  with  a  description  of  our  partic¬ 
ular  musical  application  and  some  experimental 
results.  Our  approach  was  strongly  inspired  by 
ideas  presented  in  (DeJong,  1989),  and  the  last 
section  will  relate  our  system  to  that  and  other 
work. 

2  The  General  Model 

2.1  Statement  of  the  learning  problem 

This  paper  deals  with  supervised  learning  of  nu¬ 
meric  target  concepts.  More  precisely,  the  class 
of  learning  problems  we  are  interested  in  can  be 
defined  as  follows: 


Given;  a  set  E  of  training  examples,  described 
in  terms  of  a  set  of  operational  predicates  P, 
where  we  distinguish  symbolic  predicates 
PS  and  numeric  predicates  (attributes)  PN. 
Thus,  PS  U  PN  -  P.  Also  attached  to  each 
example  c€£  is  a  numeric  attribute  T(e,v) 
(the  target  attribute)  with  known  value  v. 
(This  replaces  the  classification  in  symbol¬ 
ic  supervised  concept  learning.)  As  v  is  a 
function  of  (the  description  of)  the 
instance,  we  will  also  write  v  =  T(e).  Note 
that  so  far,  there  are  no  negative  instances  in 
this  scenario. 

Find:  a  set  of  general  rules  that  predict,  for  any 
given  object  o  described  by  predicates  eF, 
a  numeric  value  v  =  T(o),  based  on  the  de¬ 
scription  of  o. 

(As  in  symbolic  concept  learning,  we  might 
require  these  rules  to  be  complete  (predict  a 
value  for  every  example)  and  correct  (pre¬ 
dict  the  correct  value  for  each  example) 
with  respect  to  the  training  data  (cf.  Mi- 
chalski,  1983).  However,  this  may  not  be 
100%  desirable  or  feasible  in  every  ap¬ 
plication  domain.) 

In  addition,  there  may  be  some  domain-specific 
background  knowledge  (BK)  relating  the  target 
concept  T(X,V)  (or  some  abstractions  of  T- see 
below)  to  some  of  the  operational  predicates  P 
in  specific  ways,  possibly  via  some  intermediate 
non-operational  predicates.  This  knowledge 
might  be  in  the  form  of  rules,  as  in  standard  EBL 
domain  theories  (Mitchell  et  al.,  1986)  or  in  the 
form  of  qualitative  knowledge  items  as  in  (Wid- 
mer,  1993).  The  knowledge  need  not  be  correct 
or  complete,  nor  need  it  be  quantitative  and  pre¬ 
cise.  An  additional  constraint  then  is  to  find 
solutions  (rules)  that  conform  as  closely  as  pos¬ 
sible  to  BK  while  also  consistently  describing 
the  training  data  E. 


The  learning  model  we  are  going  to  introduce  in 
the  next  section  includes  a  symbolic  learning 
component  that  can  utilize  qualitative  back¬ 
ground  knowledge  for  generating  plausible  ex¬ 
planations.  For  this  method  to  be  applicable,  we 
need  to  make  the  following 

Assumptions: 

1)  We  assume  that  there  are  some  discrete,  qual¬ 
itative  sub-concepts  Ti(X)  of  the  target  con¬ 
cept  T(Xy)  that  can  naturally  be  distin¬ 
guished,  where  a  sub-concept  is  detined  by  a 
more  or  less  clearly  distinguished  subrange 
of  the  function  value  V. 

2)  We  further  assume  that  it  is  these  discrete 
sub-concepts  that  are  related  to  operational 
predicates  P  by  the  available  background 
knowledge  BK. 

3)  Finally,  we  assume  that  examples  of  the  dis¬ 
crete  sub-concepts  Ti  can  be  distinguished 
using  the  operational  predicates  P. 

For  example,  in  our  energy  demand  prediction 
task,  such  subconcepts  might  be  extreme- 
lyJow(Demand)  or  higher_than_capacity(De- 
mand);  in  the  musical  domain  described  below, 
there  are  natural  qualitative  subconcepts  such  as 
crescendo(Note)  and  diminuendo(Note)  (in¬ 
crease  or  decrease,  respectively,  in  loudness  rel¬ 
ative  to  the  current  level)  or  acceierando(Note) 
and  ritardando(Note)  (increase  or  decrease  in 
tempo). 

The  motivation  for  this  assumption  is  that  these 
discrete,  qualitative,  symbolic  sub-concepts 
will  be  the  target  concepts  for  the  symbolic 


i) 

The  boundanes  between  these  subconcepts  will 
sometimes  have  to  be  defined  somewhat  arbitrarily. 
This  is  not  necessarily  a  problem,  as  the  results  of  the 
symbolic  learning  component  are  not  used  for  classifi¬ 
cation,  but  only  to  find  appropriate  sets  of  instances  for 
comparison.  Section  2.2  will  make  that  clearer. 


learning  component.  Each  of  the  original  train¬ 
ing  instances  will  be  assigned  to  one  of  the  sub¬ 
concepts  7i,  depending  on  its  value  v  =  T(e),  and 
the  symbolic  learner  will  learn  general  rules  for 
each  sub-concept.  Note  that  in  this  way  we  also 
introduce  negative  instances  for  each  target  con¬ 
cept  7i,  namely,  all  examples  assigned  to  some 
Tj  where 


2.2  The  learning  model 

Returning  to  our  general  learning  problem,  one 
way  to  approach  it  would  be  to  simply  do 
instance-based  learning  in  the  entire  descrip¬ 
tion  space  spanned  by  all  the  available  attributes 
P,  symbolic  and  numeric.  That  is,  training 
instances  would  be  stored  along  with  their  com¬ 
plete  descriptions,  and  the  value  v  =  T(o)  for 
some  new  object  o  would  be  predicted  by  some 
nearest  neighbor  method  in  the  space  of  stored 
instances,  possibly  with  some  numeric  inter¬ 
polation.  There  are  several  problems  with  this 
approach.  First,  it  is  not  always  clear  how  to  de¬ 
vise  a  similarity  metric  that  combines  symbolic 
and  numeric  attributes  in  a  meaningful  way,  es¬ 
pecially  when  the  attributes  are  of  various  types 
and  inhomogeneous  with  respect  to  domain  size 
etc.  Second,  the  only  way  to  integrate  available 
qualitative  background  knowledge  into  the 
learning  process  is  via  the  similarity  metrics. 
This  may  not  be  the  most  natural  way  to  express 
one’s  domain  knowle'^ge.  Moreover,  instance- 
based  approaches  suffer  from  the  problem  that 
they  do  not  produce  comprehensible  concept 
descriptions.  On  the  other  hand,  it  is  clear  that 
some  kind  of  instance-based  interpolation  com¬ 
ponent  is  needed  in  such  domains,  as  the  task  is 
to  predict  numeric  values  from  continuous  do¬ 
mains,  which  is  impossible  with  discrete,  sym¬ 
bolic  concept  descriptions. 


The  model  were  are  proposing  here  consists  of 
two  components:  a  symbolic  learning  compo' 
nent  that  learns  to  distinguish  different  types  of 
situations  and  can  utilize  all  the  available  do¬ 
main  knowledge,  and  an  instance-based  com¬ 
ponent  which  stores  the  instances  with  their  pre¬ 
cise  numeric  attribute  values  and  can  predict  the 
target  value  for  some  new  object  by  numeric  in¬ 
terpolation  over  known  instances.  The  connec¬ 
tion  between  these  two  components  is  as  fol¬ 
lows:  each  rule  (conjunctive  hypothesis) 
learned  by  the  symbolic  leanting  component  de¬ 
scribes  a  subset  of  the  instances;  these  are  as¬ 
sumed  to  represent  one  particular  subtype  of  the 
concept  to  be  learned.  All  the  instances  covered 
by  a  rule  are  given  to  the  instance-based  learner 
to  be  stored  together  in  a  separate  instance 
space.  Predicting  the  target  value  for  some  new 
object  then  involves  matching  the  object  against 
the  symbolic  rules  and  using  only  those  numeric 
instance  spaces  (interpolation  tables)  for  predic¬ 
tion  whose  associated  rules  are  satisfied  by  the 
object.  In  this  way,  the  system  learns  several  dis¬ 
tinct  instance  spaces  where  different  laws  and 
regularities  may  apply.  In  fact,  different 
instance  spaces  may  contain  examples  with  con¬ 
flicting  values. 

More  precisely,  the  target  concepts  for  the  sym¬ 
bolic  learning  component  are  the  discrete,  qual¬ 
itative  sub-concepts  Ti  mentioned  in  section 
2.1.  The  symbolic  learner  learns  general  condi¬ 
tions  that  characterize  or  discriminate  between 
these  discrete  classes.  These  conditions  may  re¬ 
fer  to  both  symbolic  and  numeric  predicates. 
The  symbolic  learner  tries  to  use  all  the  avail¬ 
able  qualitative  background  knowledge.  The  re¬ 
sult  produced  by  this  component  is  a  set  of  gen¬ 
eral  rules  that  group  the  examples  into  clusters 
by  assigning  them  to  different  sub-classes  of  the 
target  concept. 
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The  numeric,  instance-based  component  takes 
the  original  training  instances  E  as  clustered  by 
the  symbolic  learner,  and  creates  a  separate 
instance  space  from  each  cluster.  Instances  are 
stored  with  all  their  numeric  attributes  and  with 
their  precise  numeric  target  values.  The  dimen¬ 
sions  of  such  an  instance  space  are  thus  defined 
by  the  numeric  attributes  PN.  For  some  new  ob¬ 
ject  o,  the  target  value  v  =  T(o)  can  then  be  pre¬ 
dicted  by  selecting  the  appropriate  instance 
space  (by  using  the  generated  symbolic  rules  as 
filters),  and  applying  some  numeric  inteipola- 
tion  method  over  the  stored  instances. 

Several  comments  seem  to  be  in  order  here: 
First,  we  assume  that  the  symbolic  learning 
component  ma)  refer  to  predicates  from  both 
PS  and  PN  for  its  hypotheses.  It  is  not  realistic 
(nor  necessary)  to  expect  that  the  discrete  sub¬ 
concepts  Ti  can  always  be  distinguished  by  ref¬ 
erence  to  symbolic  predicates  only.  Second,  we 
do  assume  that  after  clustering  the  examples  ac¬ 
cording  to  sub-concepts  (and  sub-sub-con¬ 
cepts,  if  these  are  disjunctive),  interpolation 
over  the  numeric  attributes  in  the  resulting 
instance  spaces  is  sufficient  to  predict  sensible 
target  values.  In  other  words,  we  assume  that  the 
rules  learned  for  the  sub-concepts  Ti  contain  all 
the  relevant  symbolic  information.  The  dimen¬ 
sions  of  the  instance  spaces  are  only  attributes 
from  PN.  Any  other  solution  would  require 
some  non-standard  interpolation  scheme  to  ar¬ 
rive  at  numeric  prediction  values.  If  additional 
domain  knowledge  about  attribute  relevance  is 
available,  the  number  of  numeric  dimensions 
may  still  be  reduced,  or  some  specialized  simi¬ 
larity  measures  may  be  used  for  interpolation. 


3  Realization  of  the  Model: 
IBL-Smart 

The  general  method  has  been  implemented  in  a 
system  named  IBL-S  mart  and  has  been  tested  in 
the  context  of  a  cotrq)lex  musical  problem.  In 
accordance  with  the  trxxlel,  IBL-Smart  consists 
of  two  components.  The  first  of  these  —  the 
symbolic  learner  —  has  been  specifrcally  de¬ 
signed  to  be  able  to  use  qualitative  domain 
knowledge. 

3.1  The  symbolic  learning  componoit 

The  symbolic  learner  in  IBL-Smart  is  a  multi¬ 
ple-instance  plausible  explanation  system 
based  on  the  search  algorithm  of  ML-Smart 
(Bergadano  &  Giordana,  1988).  It  performs 
top-down  discrimination,  integrating  and  inter¬ 
leaving  deductive  and  inductive  operationaliza¬ 
tion  Steps.  The  basics  of  the  search  are  described 
below  (section  3.1.1).  For  IBL-Smart,  we  have 
extended  ML-Sman’s  discrimination  algo¬ 
rithm  to  also  use  qualitative  background  knowl¬ 
edge  in  the  form  of  general  dependency  state¬ 
ments  and  (Erected  qualitative  dependency  rela¬ 
tions.  This  is  described  in  section  3.1.2. 

3.1.1  The  basic  search  algorithm 

The  learner  starts  with  a  nonoperational  defini¬ 
tion  of  the  target  concept  (some  discrete  sub¬ 
concept  Ti)  and  performs  stepwise  operational¬ 
ization  (specialization)  by  growing  a  heuristic 
best-first  search  tree.  Each  node/partial  hypoth¬ 
esis  in  the  search  tree  is  accompanied  by  its  ex¬ 
tension,  i.e.,  the  positive  and  negative  examples 
covered  by  the  operational  part  of  the  expres- 
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sion.  This  makes  it  possible  m  use  coverage 
measures  as  part  of  the  search  heuristic. 

As  in  ML-Smart,  each  step  in  the  search  is  ei¬ 
ther 

(1)  a  deductive  appliccaion  of  a  rule  from  the 
domain  theory  —  replacing  a  non-opera- 
tional  literal  by  its  sufficient  conditions  as 
defined  by  the  rule; 

(2)  an  inductive  generalization  step  —  drop¬ 
ping  a  predicate  when  the  node  covers  too 
few  positive  instances  and  thus  the  hypoth¬ 
esis  seems  too  restricted;  or 

(3)  an  inductive  specialization  step  —  adding 
some  predicate  to  the  operational  part  of  the 
hypothesis  in  order  to  exclude  some  nega¬ 
tive  instances. 

Deductive  operationalization  steps  (1)  ate  pre¬ 
ferred.  Inductive  specialization  (3)  is  done  when 
deductive  operationalization  is  not  possible 
(e.g.,  when  no  rule  is  available  or  applicable  to 
the  examples).  Inductive  generalization  (2)  is 
attempted  whenever  the  current  node  covers  too 
few  positive  instances  (according  to  some 
threshold)  and  thus  the  hypothesis  seems  too  re¬ 
strictive.  The  system  then  looks  for  a  condition 
that,  if  dropped,  would  increase  the  number  of 
instances  covered  by  the  hypothesis.  All  in  all, 
the  search  algorithm  integrates  deduction  and 
induction  in  a  fine-grained  manner. 

The  search  is  guided  by  a  heuristic  measure 
{evaluation  function)  H  that  measures  the  rela¬ 
tive  ‘goodness’  of  nodes.  The  heuristic  decides 
both  which  node  is  to  be  expanded  next,  and 
how.  Among  other  things  (see  below),  it  takes 
into  account  the  coverage  of  the  expression,  i.e., 
the  ratio  positive  /  negative  instances  covered  by 
the  node,  and  also  the  absolute  number  of  posi¬ 
tive  instances  covered. 


The  discrimination  algorithm  has  been  ex¬ 
tended  to  utilize  also  numeric  attributes  in  dis¬ 
crimination  steps.  For  numeric  attributes,  the 
system  lodes  for  a  binary  ^lit  point  that  best 
discriminates  between  positive  and  negative 
instances,  as  it  is  done  in  some  decisitm  tree 
learners  (e.g.,  Cestnik  et  al.,  1987;  Fayyad  & 
Irani,  1992).  The  general  evaluation  function  H 
of  the  search  algorithm  is  used  to  determine 
what  is  the  best  split 

3.1,2  Using  qualitative  background 
knowledge 

The  search  algorithm  as  described  above  corre¬ 
sponds  closely  to  the  original  ML-Smart  meth¬ 
od  as  presented  in  (Bergadano  &  Giordana, 
1988).  In  our  system  IBL-Smait,  the  algorithm 
has  been  extended  so  as  to  also  utilize  quanta^ 
tive  background  knowledge,  where  available. 
IBL-Smart  domain  theories  may  contain  two 
types  of  qualitative  knowledge  items: 

(1)  General  dependency  statements  of  iht  form 
depends_on(Q,Ps)  simply  state  that  some 
predicate  Q  may  be  operationalized  by  us¬ 
ing  a  set  of  specified  predicates  Ps.  This 
type  of  general  knowledge  items  has  al¬ 
ready  been  proposed  in  (Bergadano,  Gior¬ 
dana  and  Ponsero,  1989).  In  IBL-Smart, 
such  statements  tell  the  search  algorithm  to 
use  an  entire  set  of  predicates  in  one  opera¬ 
tionalization  step;  successors  of  a  node  are 
created  for  all  possible  combinations  of 
values  for  the  predicates  Ps  occurring  in 
some  positive  instances  covered  by  the 
node. 

Such  dependency  statements  are  similar, 
but  not  identical,  to  determinations  (Rus¬ 
sell,  1987).  They  permit  IBL-Smart  to  per¬ 
form  strictly  constrained  forms  of  look¬ 
ahead,  and  thus  help  overcome  blindness 


129 


effects  that  would  arise  if  die  algorithm  per- 
fcmned  purely  empirical  step-wise  special¬ 
ization.  For  instance,  they  can  be  used  to 
describe  relational  clichis  as  proposed  in 
(Silverstein  &  Pazzani,  1991). 

(2)  Directed  dependency  statements  of  the 
form  q+(A,B)  can  be  paraphrased  as  “the 
values  of  A  and  B  are  positively  propor¬ 
tionally  related”  or  “high  (ot  low)  values  of 
A  tend  to  produce  high  (or  low)  values  of  B, 
all  other  things  being  equal”.  Negative  de¬ 
pendency  (q-(A.B))  is  defined  analogous¬ 
ly.  Such  statements  are,  of  course,  re¬ 
stricted  to  functional  predicates  (or  attrib¬ 
utes)  that  assign  values  to  objects.  They 
were  already  used  in  (>^dmer,  1991)  and 
are  similar  to  Michalski’s  M-descriptors 
(Michalski,  1983).  The  notation  was  bor¬ 
rowed  from  Forbus’  qualitative  propor¬ 
tionalities  (Forbus,  1984). 

In  the  search  algorithm  of  IBL-Smart,  di¬ 
rected  dependencies  are  used  like  general 
dependencies  (create  successors  for  all  pos¬ 
sible  value  combinations),  and  the  addi¬ 
tional  knowledge  about  the  direction  of  in¬ 
fluence  is  used  in  the  search  heuristic  H: 
when  evaluating  some  operationalization 
based  on  a  q+  or  q-  relation,  the  heuristic 
also  rates  the  degree  to  which  the  particular 
values  involved  match  the  direction  of  the 
dependency  (which  is  assumed  to  be  lin¬ 
ear).  Knowing  that  q->-  (A,B)  and  operation¬ 
alizing  condition  B(.)=  b,  an  operational¬ 
ization  B(.)=  b  because  A(.)  =  a  for  specif¬ 
ic  values  b  and  a  will  be  regarded  the  more 
plausible  the  more  the  relative  positions  of 
b  and  a  in  their  respective  domains  agree: 
B(.)=  high  because  A(.)  =  high  is  rated  as 
more  plausible  than  B(.)s  high  because 
A(.)  =  low  (see  also  Widmer,  1993.).  Hy¬ 


potheses  constructed  by  EBLr-Smait  will 
tend  to  include  those  attributes  that  most 
closely  approximate  such  linear  constraints 
between  the  data  and  the  background 
knowledge. 

Note  that,  as  with  stria  deductive  rules,  such 
qualitative  dependency  statements  need  not 
necessarily  be  entirely  cmrea  in  order  to  have  a 
positive  impaa  on  the  search.  If  a  dependency' 
statement  is  correct,  it  will  lead  to  fiasta  conver¬ 
gence;  if  it  is  too  general  (the  given  predicates 
are  not  sufficient  to  complaely  discriminate  be¬ 
tween  positive  and  negative  instances),  subse¬ 
quent  empirical  discrimination  steps  will  refine 
it  And  if  it  is  overly  restrictive  (some  predicates 
are  not  necessary),  this  may  be  repaired  by  em¬ 
pirical  generalization  steps,  where  the  predi¬ 
cates  that  are  too  restrictive  are  removed  from 
the  hypothesis  to  arrive  at  a  more  general  partial 
concept  description. 

By  taking  into  account  both  such  inference-de¬ 
pendent  plausibility  measures  and  information 
about  the  numbers  of  positive  and  negative 
instances  coveted  by  a  node,  the  search  heuristic 
combines  weak,  imprecise  background  knowl¬ 
edge  with  empirical  information  from  the  train¬ 
ing  data,  producing  hypotheses  that  tend  to  cor¬ 
respond  to  the  background  knowledge  as  much 
as  the  data  permit  and  overriding  the  back¬ 
ground  knowledge  if  the  data  are  in  conflia  with 
the  knowledge. 

3.2  The  numeric  instance-based  component 

The  result  of  this  learning  step  is  a  concept  hy¬ 
pothesis  for  a  discrete,  qualitative  sub-concept 
in  the  form  of  a  DNF  expression,  where  each 
conjunct  describes  one  particular  subtype  of  the 
sub-concept.  The  instance-based  learner  now 
collects  all  the  training  instances  covered  by  a 
particular  conjunct  and  builds  an  instance  store 
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Fig.  1:  Sketch  oflBL-Smart 


in  the  form  of  an  interpolation  table,  using  these 
examples.  In  the  absence  of  knowledge  about 
the  relevance  of  the  numeric  attributes  (JPN)  to 
the  target  value,  the  dimensions  of  the  interpola¬ 
tion  table  are  chosen  to  be  all  the  numeric  attrib¬ 
utes  (C  PN)  shared  by  the  selected  instances 
(not  all  instances  may  have  defined  values  for  all 
attributes),  and  the  output  dimension  is  the  val¬ 
ue  of  the  target  variable  V  =  T(X). 

When  given  a  new  instance  for  which  to  predict 
the  value  of  the  target  variable,  the  system 
matches  the  instance  against  all  learned  rules, 
retrieves  those  instance  spaces  whose  associated 
rules  are  matched,  and  computes  a  value  for  the 
instance’s  target  value  by  interpolation  in  each 
of  the  retrieved  spaces.  If  the  instance  matches 
more  than  one  rule,  and  thus  target  values  are 
computed  in  several  spaces,  the  target  values  are 
simply  averaged.  Lacking  more  specific  knowl¬ 
edge  about  the  relationships  between  the  vari¬ 
ous  numeric  parameters,  we  use  the  Euclidean 
distance  as  the  similarity  measure  and  perform 
linear  interpolation.  Figure  1  summarizes  the 
basic  structure  of  IBL-Smart. 


4  An  Application  of  IBL-Smart: 
Learning  Expressive  Interpretation 

IBL-Smart  has  been  jq)plied  to  a  complex  prob¬ 
lem  from  the  domain  of  tonal  music,  namely,  ex¬ 
pressive  performance  or  interpretation  of  writ¬ 
ten  music.  By  this  we  understand  the  variations 
in  tempo  and  loudness  that  a  performer  applies 
(consciously  or  unconsciously)  to  the  notes  of  a 
piece  during  performance.  When  played  exactly 
as  written,  most  pieces  of  tonal  music  would 
sound  utterly  mechanical  and  lifeless. 

There  are  basically  three  dimensions  to  expres¬ 
sive  performance:  variations  in  tempo  {"ruba- 
to” ),  in  loudness  ( "dynamics” )  and  in  the  dura¬ 
tion  of  notes  as  actually  played,  as  opposed  to 
the  notated  length  {"articulation”). 

In  this  presentation,  we  will  restrict  ourselves  to 
the  dimension  of  dynamics.  As  mentioned  in  the 
introduction,  this  concept  is  inherently  numeric, 
as  the  task  is  to  decide  not  just  whether  or  not  to 
play  some  note  louder  or  softer,  but  exactly  by 
how  much.  Nevertheless,  there  are  two  discrete, 
qualitative  sub-concepts  that  can  naturally  be 
distinguished:  crescendo(Note)  and  diminuen- 
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cio(Note}  -  whether  a  note  is  to  be  played  louder 
or  softer,  respectively,  than  some  standard  level. 
These  are  the  target  concepts  for  the  plausible 
explanation  component  The  precise  amounts 
by  which  the  loudness  is  to  be  varied  are  numer¬ 
ic  multiplication  factors  that  are  to  be  learned  by 
the  instance-based  component 

Training  instances  are  derived  from  actual  per¬ 
formances  of  piano  pieces  recorded  on  an  elec¬ 
tronic  piano  via  a  MIDI  interface.  At  the  mo¬ 
ment  we  are  restricting  ourselves  to  single  line 
melodies  (with  additional  information  aboutthe 
underlying  harmonic  structure  of  the  piece). 
That  is,  the  input  is  a  sequence  of  notes,  de¬ 
scribed  in  terms  of  various  predicates  and  ac¬ 
companied  by  explicit  information  about  the  de¬ 
gree  of  crescendo  or  diminuendo  that  was  ap¬ 
plied  to  it  by  the  performer.  Each  note  of  a 
played  piece  is  a  training  instance. 

The  description  language  consists  of  predicates 
that  describe  various  features  of  a  note  and 
structural  features  of  its  surroundings.  There  are 
currently  41  operational  predicates,  of  which  21 
are  symbolic  (like  followed_by_rest(Note))  and 
20  are  numeric  (like  duration(Note,X)).  Some  of 


A  clarifying  remark  to  readers  who  feel  that  we  are 
tfivializing  the  artistic  phenomenon  of  expressive  mu¬ 
sical  performance  by  claiming  that  a  computer  pro¬ 
gram  can  easily  learn  to  rqplicate  such  behaviour,  or 
that  these  phenomena  can  be  explained  by  some  simple 
domain  theory:  We  are  not  talking  here  about  the  high¬ 
ly  artistic  details  in  variation  that  distinguish  a  great  pi¬ 
anist  or  other  performer.  We  are  convinced,  however 
(and  there  is  much  support  for  this  hypothesis  from  var¬ 
ious  areas  of  musicology),  that  expressive  performance 
does  have  a  large  ‘rational’  component,  in  that  one  of 
its  purposes  is  to  convey  an  understanding  of  musical 
structure  to  a  listener.  It  is  this  rational  part  for  which 
we  can  fmd  partial  plausible  explanations  and  which 
we  can  expect  a  computer  to  learn,  provided  it  is 
equipped  with  the  necessary  musical  knowledge  and  a 
suitable  vocabulary. 


these  predicates  are  computed  by  a  pre-i»oces- 
sing  component  which  performs  a  music-theo¬ 
retic  analysis  ofthe  given  piece  in  temos  of  s(»ne 
relevant  musical  structures  (e.g.,  phrases  and 
various  types  of  ‘processes’  such  as  linear  me¬ 
lodic  lines  (ascending  <»*  descending),  rhythmic 
patterns,  etc.).  Many  numme  attributes  then  de¬ 
scribe  the  relative  position  of  a  note  in  a  phrase 
or  in  a  ‘process’.  Note  that  the  number  of  attrib¬ 
utes  defined  for  a  given  note  varies:  stnne  notes 
occur  in  many  patterns,  others  only  in  some.  So 
not  all  numeric  attributes  are  defined  for  every 
note. 

The  background  knowledge  for  this  problem  is 
mainly  in  the  form  of  directed  and  undirected 
dependency  statements.  The  domain  theory  is  a 
hierarchy  of  such  dependency  statements  and 
some  crisp  rules.  The  top  level  of  the  theory  re¬ 
lates  the  phenomenon  of  loudness  variations  to 
some  abstract  musical  notions  by  a  set  of  depen¬ 
dencies  like 

depends_on(  crescendo(Note,)(), 

[  salience(Note,Y)]). 
depends_on(  crescendo(Note,}(}, 

( goal_directedness(Note,Y)]), 
depends_on(  crescendo(Note,)0, 

[  closure(Note,Y)]}. 

The  first  of  these  can  be  paraphrased  as 

'‘Whether  crescendo  should  be  applied  to  a  note 
(and  if  so,  the  exact  amount  X)  depends,  among 
other  things,  on  the  structural  importance  (sa¬ 
lience)  Y  ofthe  note.” 

and  analogously  for  the  other  ones. 

The  abstract  notions  salience,  goal_directed- 
ness,  and  closure  are  then  again  related  to  low¬ 
er-level  musical  effects,  all  the  way  down  to 
some  surface  features  of  training  instances,  for 
example: 
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Fig.  2:  Beginnings  of  three  little  minuets  by  JSBach 


q+(  metrical_strength(Note,X), 

stabiiity(Note,Y))  and 

q+(  harmonic__stability(Note,X), 
stability(Note,Y)) 

“The  degree  of  stability  Y  of  a  note  is  positively 
proportionally  related  (among  other  things)  to 
the  metrical  strength  X  of  the  note”  etc. 

where  metrical_strength  is  a  numeric  and  har- 
monic_stabjlity  is  a  symbolic  attribute  (with  a 
discrete,  ordered  domain  of  qualitative  values). 
Both  are  defined  as  operational. 

Given  this  domain  theory  and  some  played 
pieces,  the  plausible  explanation  component 
learns  mixed  symbolic/numeric  rules  that  dis¬ 
criminate  various  types  of  situations  where  a 
crescendo  or  a  diminuendo  occurs.  These  rules 
are  sets  (disjunctions)  of  conjunctive  condi¬ 
tions;  each  conjunct  describes  a  particular  class 
of  crescendo/diminuendo  situadons.  For  each 
conjunct,  a  numeric  interpolation  table  (in¬ 
stance  space)  is  created  which  contains  all  the 
instances  covered  by  the  conjunct.  The  set  of  all 
numeric  attributes  shared  by  all  the  instances 


covered  by  a  conjunct  defines  the  dimensions  of 
the  respective  interpolation  table. 

5  An  Experiment 

Several  experiments  with  comparatively  simple 
piano  pieces  have  been  performed.  In  one  ex¬ 
periment,  we  chose  three  well-known  minuets 
from  J.S.Bach’s  Notenbuchlein  fur  Anna  Mag¬ 
dalena  Bach  as  training  and  test  pieces.  The  be¬ 
ginnings  of  the  three  minuets  are  shown  in  Fig¬ 
ure  2.  All  three  pieces  consist  of  two  parts.  The 
second  part  of  each  piece  was  used  for  training; 
they  were  played  on  an  electronic  piano  by  the 
author,  and  recorded  through  a  MIDI  interface. 
After  learning,  the  system  was  tested  on  the  first 
pans  of  the  same  pieces.  In  this  way,  we  com¬ 
bined  some  variation  in  the  training  data  (three 
different  pieces)  with  some  uniformity  in  style 
(three  pieces  from  the  same  period  and  with 
similar  characteristics;  test  data  from  the  same 
pieces  as  training  data,  though  different). 

The  training  input  consisted  in  212  examples 
(notes),  of  which  79  were  examples  of  crescen- 
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do,  and  120  were  examples  of  diminuendo  (the 
rest  were  played  in  a  neutral  way).  The  system 
learned  14  mles  (conjuncts)  and,  correspond¬ 
ingly,  14  interpolation  tables  characterizing  cre¬ 
scendo  situations,  and  15  rules  for  diminuendo. 
Quite  a  number  of  instances  were  covered  by 
more  than  one  rule.  For  illustration,  here  is  a 
simple  rule  for  crescendo: 

crescendo(Note.X) 

metrical  strength(Note,S), 

S  >  4.0.” 

harmony_stability(Note,high), 
previousJnterval(Note.l  1 ) , 
directional  ,up). 
nextjnterval  (Note,  12) , 
dire^ion(l2,down). 

“Apply  some  crescendo  to  the  current  note  if 
the  metrical  strength  of  the  note  is  >  4 
and  the  underlying  harmony  is  stable 
and  the  direction  of  the  melody  from 
the  previous  to  the  current  note  is  up 
and  the  direction  of  the  melody  from  the 
current  note  to  the  next  is  down” 

The  quality  of  the  learning  results  is  not  easy  to 
measure,  as  there  is  no  precise  criterion  to  de¬ 
cide  whether  some  performance  is  right  or 
wrong.  Judging  the  correctness  is  a  matter  of  lis¬ 
tening.  Unfortunately,  we  cannot  attach  a  re¬ 
cording  to  this  paper  so  that  the  reader  can  ap¬ 
preciate  the  results.  Instead,  Figure  3  depicts  a 
pan  of  one  of  the  training  pieces  (the  second  pan 
of  the  first  minuet  in  G  major  as  played  by  the 
author),  and  also  shows  the  performance  created 
by  the  system  for  a  test  piece  (the  first  part  of  the 
same  minuet)  after  learning.  The  figures  plot  the 
relative  loudness  with  which  the  individual 
notes  were  played.  A  level  of  1.0  would  be  neu¬ 
tral,  values  above  1.0  represent  crescendo  (in¬ 
creased  loudness),  values  below  1.0  diminuen¬ 
do. 


The  reader  familiar  with  standard  music  nota¬ 
tion  may  appreciate  that  there  are  strong  similar¬ 
ities  in  the  way  similar  types  of  phrases  are 
played  by  the  human  teacher  and  the  learner. 
(Note,  for  instance,  the  crescendo  in  lines  rising 
by  stepwise  motion,  and  the  decrescendo  pat¬ 
terns  in  measures  with  three  quarter  notes). 
Generally,  the  results  were  very  good,  given  the 
limited  amount  of  training  data  and  the  surface 
differences  between  training  and  test  pieces. 
Readers  not  familiar  with  music  notadon  will 
have  to  take  our  word  for  it.  We  are  planning  ex¬ 
periments  with  other,  non-musical  domains 
where  the  results  will  be  more  easily  interpret- 
able  and  testable. 

In  a  comparative  experiment,  we  also  tested  a 
system  restricted  to  learning  only  in  an 
instance-based  way,  that  is,  with  interpolation 
tables,  but  without  the  symbolic  explanation 
component.  This  learner  used  all  the  available 
attributes,  both  numeric  and  symbolic.  The  fol¬ 
lowing  distance  metric  was  used:  all  numerical 
attributes  were  scaled  between  0  and  1,  and  for 
symbolic  attributes,  the  distance  was  defined  to 
be  0  in  the  case  of  a  match  and  1  otherwise.  As 
not  all  training  instances  share  all  numeric  di¬ 
mensions,  the  system  learned  as  many  inter¬ 
polation  tables  as  there  were  combinations  of 
numeric  attributes  occurring  in  the  training  data 
(18  for  crescendo,  12  for  decrescendo).  The  re¬ 
sults  on  the  same  data  were  considerably  worse. 
The  learner  did  not  distinguish  as  well  between 
different  types  of  situations,  and  the  results  are 
rather  blurred,  as  can  also  be  seen  from  Figure  4, 
which  shows  the  same  test  piece  as  played  by  the 
second  system. 


u 
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Fig.  3:  Parts  of  a  training  piece  as  played  by  teacher  (top)  and 
test  piece  as  played  by  learner  efier  learning  (bottom) 
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Fig.  4:  Part  of  test  piece  as  played  after  instance-based  learning  only 
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6  Discussion,  Related  Work,  and 
Related  Matters 

First,  let  us  briefly  recapitulate  the  main  charac¬ 
teristics  of  the  learning  model:  (1)  The  model 
can  learn  precise  numeric  concepts  via  an 
instance-based  method  while  using  available 
qualitative  background  knowledge  through  a 
symbolic  learning  component  (2)  The  symbol¬ 
ic  learning  component  defines  and  separates  dif¬ 
ferent  independent  regions  in  numeric  instance 
space  where  different  regularities  may  apply. 
This  allows  the  instance-based  learner  to  build 
specialized  instance  stores,  which  may  yield 
very  specific  prediction  behaviour.  And  (3),  as  a 
side  effect,  learning  mles  for  discrete  sub-con¬ 
cepts  clusters  the  examples  around  meaningful 
abstractions,  which  may  be  useful  for  other 
tasks. 

The  definition  of  abstract  sub-concepts  Ti 
introduces  a  natural  distinction  between  sym¬ 
bolic  and  numeric  learning,  and  also  produces 
negative  instances  for  the  symbolic  learner. 
That  the  background  knowledge  is  used  only  by 
the  symbolic  component  seems  natural,  given 
that  it  is  qualitative  and  thus  may  explain  ab¬ 
stract,  symbolic  concepts  (at  best),  but  certainly 
not  precise  numeric  values  and  relationships. 
Of  course,  this  does  not  preclude  the  use  of  addi¬ 
tional  knowledge  to  guide  or  constrain  numeric 
learning  in  the  instance-based  component. 


For  instance,  in  music,  we  may  be  able  to  explain  why 
a  performer  applied  some  crescendo  at  a  certain  point 
(for  instance,  in  order  to  stress  a  musically  impextant 
event),  but  we  can  never  explain  why  she  chose  exactly 
that  precise  degree  of  crescendo.  A  system  can  only  re¬ 
cord  these  precise  degrees  and  try  to  replicate  the  same 
behaviour  in  similar  situations.  What  is  similar  is  deter¬ 
mined  by  the  rules  learned  by  the  symbolic  component 


It  should  be  remembered  that  this  is  a  general 
learning  model;  the  system  presented  here  — 
IBL-Smart  —  is  just  one  particular  incarnation 
of  a  more  general  approach.  We  have  found  it 
convenient  to  use  a  best-first  search  algorithm 
like  the  ML-Smart  learner  as  the  basis  for  our 
plausible  explanation  component,  as  it  explicit¬ 
ly  constructs  a  search  tree  and  allows  us  to  inte¬ 
grate  various  sources  of  knowledge  into  the 
learning  process  via  the  search  heuristic  (evalu¬ 
ation  function).  However,  with  appropriate 
modificadons  and  extensions,  other  symbolic 
learners  capable  of  udlizing  incomplete  and  in¬ 
consistent  knowledge  —  for  instance,  F(X)L 
(Pazzani  &  Kibler,  1992)  —  might  be  used  just 
as  well  in  this  framework. 

Similarly,  more  elaborate  strategies  could  be 
used  in  the  instance-based  component.  (Aha  et 
al.,  199 1 )  have  described  a  number  of  instance- 
based  learning  methods  that  could  be  applied 
within  a  framework  such  as  ours.  Also,  avail¬ 
able  domain  knowledge  about  the  reladve  de¬ 
gree  of  relevance  of  numeric  attributes  or  about 
the  domains  and  typical  values  of  numeric  vari¬ 
ables  could  be  used  to  devise  more  sophisdeated 
similarity  metrics,  tailored  to  the  particular  ^- 
plication. 

With  respect  to  related  work,  we  acknowledge 
the  important  influence  on  this  project  by  some 
of  the  ideas  expressed  in  (DeJong,  1989).  De- 
Jong  had  presented  a  system  that  combined  a 
very  weak  notion  of  plausible  inference  over 
single  cases  with  numeric  variables.  Our  ap¬ 
proach  departs  from  his,  among  other  things,  in 
the  variety  of  types  of  background  knowledge 
and  in  the  use  of  a  heurisdcally  guided,  search- 
based,  multi-instance  explanation  algorithm 
that  allows  much  more  control  over  the  learning 
process.  Not  only  does  this  search  introduce  a 
strong  notion  of  empirical  plausibility  by  taking 
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into  account  the  distribution  of  instances;  the 
use  of  an  explicit  search  heuristic  also  makes  it 
possible  to  exploit  the  qualitative  knowledge 
conxaixitdm  qualitative  dependeru:ies  (q+,  q-) 
to  compute  the  relative  plausibility  of  argu¬ 
ments.  The  best-first  search  is  very  likely  to  find 
explanations  that  are  most  plausible  overall 
(both  with  respect  to  the  knowledge  and  the 
data).  Delong’s  system,  on  the  other  hand,  sim¬ 
ply  assumed  that  the  syntactically  simplest  ex¬ 
planation  was  also  the  most  plausible  one. 

As  an  additional  advantage  of  this  multi¬ 
instance  explanation  approach,  we  note  also  that 
there  is  a  natural  way  to  deal  with  certain  types 
of  noise  in  the  training  data.  The  evaluation 
function  of  the  search  algorithm  incorporates 
two  thresholds:  it  accrpts  only  nodes  (con- 
juncts)  that  cover  some  minimum  number  of 
positive  instances,  and  the  termination  criterion 
allows  the  search  .o  halt  when  a  certain  percent- 
(<  100  %)  of  poi .  ^  Instances  are  covered, 

.us  *he  system  cai  r !  ore  rare  instances  that 
ok  -.ke  exceptions,  but  are  really  the  result  of 
noise.  By  varying  these  thresholds,  the  system 
can  be  tuned  to  the  characteristics  of  different 
application  domains. 

In  fact,  the  musical  experiments  described  in  the 
previous  section  were  characterized  very 
strongly  by  noise  in  the  data,  originating  from 
the  author’s  imperfect  piano  technique,  from  the 
imprecise  boundaries  between  the  abstract  sub¬ 
concepts  crescendo  and  diminuendo,  and  from 
imprecision  inherent  in  the  domain  itself  (there 
are  simply  no  100%  laws  as  to  how  some  pas¬ 
sage  must  and  will  be  played;  variation  will  in¬ 
variably  happen).  The  system  concentrated  on 
learning  typical  variations. 

Of  course  (and  this  is  also  implied  by  the  name), 
our  system  also  owes  a  lot  to  the  work  on  inte¬ 
grated  deductive-inductive  learning  in  ML- 


Smart  (Bergadano  &  Giordana,  1988).  We  have 
extended  the  ML-Smart  algorithm  to  also  uti¬ 
lize  background  knowledge  in  the  form  of  di¬ 
rected  dependency  statements  (which  are  a  very 
natural  kind  of  knowledge  in  many  domains). 

With  respect  to  the  system  described  in  (Wid- 
mer,  1991;  1993),  which  also  constructs  plausi¬ 
ble  explanations  of  individual  training  instances 
on  the  basis  of  qualitative  background  knowl¬ 
edge,  we  note  that  explaining  multiple  instances 
at  a  time  adds  a  strong  empirical  justification  to 
plausible  explanations.  The  price  is  non-incre- 
mentality.  However,  it  is  likely  that,  using  tech¬ 
niques  described  in  (Widmer,  1989),  IBL- 
Sman  can  be  made  to  learn  incrementally  with¬ 
out  losing  too  much  in  effectiveness. 
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Abstract 

This  paper  introduces  a  technique 
for  using  the  randomized  nature  of 
some  learning  algorithms  to  increase 
their  accuracy.  Our  method  is  to  gen¬ 
erate  mxdtiple  classifiers  and  combine 
them  with  a  majority  voting  scheme. 

The  purpose  of  this  technique  is  to 
overcome  small  errors  that  appear  in 
individual  classifiers.  We  have  tested 
our  idea  on  a  type  of  randomized  de¬ 
cision  tree  with  real  data,  and  found 
that  it  consistently  improves  the  ac¬ 
curacy  over  that  of  average  trees.  We 
have  also  shown  that  this  technique 
outperforms  some  other  methods  that 
attempt  to  improve  accuracy  by  using 
randomization  in  a  different  way. 

1  Introduction 

Decision  trees  have  been  used  successfully  for 
many  different  decision  making  and  classifi¬ 
cation  tasks.  A  number  of  standard  tech¬ 
niques  have  been  developed  in  the  machine 
learning  community,  most  notably  Quinlan’s 
IDS  algorithm  (1986)  and  Breiman  et  a/.’s 
CART  algorithm  (1984).  Since  the  intro¬ 
duction  of  these  algorithms,  numerous  vari¬ 
ations  and  improvements  have  been  put  for¬ 
ward,  including  new  pruning  strategies  (e.g., 


Quinlan,  1987)  and  incremental  versions  of 
the  algorithms  (Utgoff,  1989).  Many  of  these 
refinements  have  been  designed  to  produce 
better  decision  trees;  i.e.,  trees  that  were 
either  more  accurate  classifiers,  or  smaller 
trees,  or  both. 

The  main  goal  of  our  research  is  to  pro¬ 
duce  classifiers  that  provide  the  most  accu¬ 
rate  model  possible  for  a  set  of  data.  To 
achieve  our  goal,  we  have  combined  a  stan¬ 
dard  method  for  classification  -  decision  trees 
-  with  two  other  ideas.  The  first  idea  is  ran¬ 
domization,  which  in  this  context  allows  us 
to  generate  many  different  trees  for  the  same 
task.  The  second  idea  is  majority  voting, 
which  is  used,  e.g.,  by  fc-nearest-neighbor 
methods  to  decide  on  a  classification.  Here 
we  use  a  majority  vote  of  k  decision  trees  to 
classify  examples. 

2  Randomization  in  Learning 
Algorithms 

In  a  previous  work,  [Heath,  1992],  we  in¬ 
troduced  the  SADTlearning  algorithm  (de¬ 
scribed  below).  In  that  work,  we  explored 
the  generation  of  decision  trees  comprised  of 
tests  that  are  linear  inequalities  over  the  at¬ 
tributes  {oblique  decision  trees).  This  is  a 
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generalization  of  standard  decision  tree  tech¬ 
niques,  in  which  each  node  of  a  tree  is  a 
test  of  a  single  attribute.  We  showed  that, 
when  generating  oblique  trees,  finding  even 
a  single  test  that  minimizes  some  goodness 
criteria  is  an  NP-hard  problem.  We  then 
turned  to  the  optimization  technique  of  sim¬ 
ulated  annealing  to  find  good  tests,  which 
should  generate  good  (t.e.,  small  and  accu¬ 
rate)  trees. 

Using  simulated  annealing  in  our  learning  al¬ 
gorithm  introduces  an  element  of  random¬ 
ness.  Each  time  our  SADTprogram  is  run, 
it  generates  different  trees.  This  led  us  to 
explore  methods  of  using  this  randomization 
to  our  advantage  by  generating  many  trees 
and  using  an  additional  criteria  to  choose  the 
best  tree.  Our  argument  was  that  picking 
a  good  tree  out  of  the  many  solutions  pro¬ 
duced  by  a  randomized  algorithm  may  be 
preferable  to  using  an  algorithm,  even  a  very 
clever  one,  that  only  produces  one  solution. 

In  this  paper,  we  explore  another  way  of  us¬ 
ing  randomization  to  advantage.  As  before, 
we  use  a  single  training  set  to  generate  a  set 
of  classifiers.  Instead  of  choosing  one  repre¬ 
sentative  tree,  we  attempt  to  combine  the 
knowledge  represented  in  each  tree  into  a 
new,  more  accurate,  classifier. 

Specifically,  we  take  a  set  of  classifiers  and 
combine  their  classifications  by  taking  the 
plurality.  In  binary  classification  problems, 
this  reduces  to  taking  the  majority.  For  ex¬ 
ample,  if  we  have  5  trees,  and  3  classify  an 
example  as  “0,”  and  the  other  two  classify 
it  as  “1,”  then  we  predict  the  example  be¬ 
longs  to  class  “0.”  When  this  technique  is 
applied  to  decision  trees,  we  call  the  result¬ 
ing  algorithm  A;-DT,  in  the  spirit  of  A:-NN, 
the  fc-nearest-neighbor  algorithm. 


2.1  The  advantage  of  majority 
voting 

The  premise  behind  this  idea  is  that  any 
one  tree  may  not  capture  the  target  concept 
completely  accurately,  but  will  approximate 
it  with  some  error.  This  error  differs  from 
tree  to  tree.  By  using  several  trees  and  tak¬ 
ing  the  majority,  we  hope  to  overcome  this 
type  of  error.  Consider,  for  example,  a  test 
example  x  with  probability  p(x)  of  being  cor¬ 
rectly  classified  by  a  random  two-category 
SADTtree.  If  we  take  the  majority  vote  of 
k  trees,  the  probability  that  x  is  correctly 
classified  is 

3<k 

maj{k,x)=  52 
j>kl2 

In  this  equation,  j  represents  the  number  of 
trees  that  correctly  classify  example  x.  We 
require  that  it  be  more  than  half  of  the  k 
trees,  thus  the  restrictions  on  the  sum.  p(x)^ 
represents  the  probability  of  j  trees  getting 
the  example  correct;  (1  —  p(x))*“^  is  the 
probabilitv  that  the  remaining  trees  get  it 
wrong.  Qj  simply  counts  the  number  of  pos¬ 
sible  ways  k  trees  could  divide  into  two  sets 
of  trees,  one  of  size  j.  Figure  1  shows  how 
maj{k,x)  varies  with  p(x)  when  different 
numbers  of  trees  are  used  for  the  majority. 
Note  that  for  example  x,  taking  the  major¬ 
ity  vote  increases  the  probability  of  getting 
a  correct  classification  if  p(x)  >  0.5,  but  de¬ 
creases  it  if  p(x)  <  0.5.  Let  Xi  be  the  set  of 
examples  in  the  test  set  for  which  p(x)  <  0.5, 
and  X2  be  those  for  which  p(x)  >  0.5.  If 
X  6  Xi,  it  is  to  our  advantage  to  use  the 
classifiers  directly.  If,  on  the  other  hand, 
X  €  X2,  taking  the  majority  will  increase  the 
probability  that  we  will  classify  x  correctly. 
For  any  given  test  set,  there  will  likely  be 
points  in  both  cases.  Obviously,  we  cannot 
tell,  given  a  particular  example,  whether  it 
belongs  to  X\  or  X2  unless  we  know  its  clas- 
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sification.  However,  it  is  our  experience  that 
often  the  benefit  we  get  by  increasing  the 
likelihood  of  a  correct  classification  for  those 
examples  in  X2  outweighs  the  loss  in  accu¬ 
racy  we  get  on  the  examples  in  Xi . 

Contrary  to  intuition,  simply  increasing  the 
number  of  trees  participating  in  a  majority 
voting  scheme  does  not  necessarily  increase 
the  expected  accuracy  of  the  classifier. 

At  first  glance,  it  may  appear  that  the  more 
trees  that  are  used,  the  higher  will  be  the 
resulting  accuracy.  However,  this  is  not  nec¬ 
essarily  true.  An  implication  of  this  is  that 
choosing  the  appropriate  value  for  k  may  be 
a  difficult  problem. 

We  have  already  seen  that  for  some  examples 
(those  with  less  than  50%  probability  of  be 
correctly  classified  by  the  average  tree),  us¬ 
ing  a  majority  vote  will  lower  the  chances  of 
a  correct  classification,  and  the  more  trees 
used,  the  lower  the  resulting  accuracy  will 
be.  On  the  other  hand,  increasing  the  num¬ 
ber  of  trees  involved  in  the  vote  will  increase 
the  accuracy  on  those  points  likely  to  be  clas¬ 
sified  correctly  by  the  average  tree.  When 
we  try  using  a  majority  voting  scheme  on  a 
mixture  of  these  two  types  of  examples,  we 
will  get  a  mixed  result.  Consider  two  exam¬ 
ples,  ei  and  e^.  If  we  generate  many  trees, 
on  average  e\  is  classified  correctly  45%  of 
the  time,  and  62  is  classified  correctly  80% 
of  the  time.  As  shown  in  Figure  2,  if  we 
use  a  majority  voting  scheme,  then  ci  will 
rarely  be  classified  correctly,  but  62  will  al¬ 
most  always  be  classified  correctly.  Figure  2 
also  shows  the  combined  expected  accuracy 
for  the  set  {61,62}.  If  we  generate  a  series 
of  trees  and  use  each  one  to  classify  the  two 
examples,  we  expect  their  average  accuracy 
to  be  62.5%.  If  we  use  majority  voting,  we 
expect  the  accuracy  to  increase  up  to  about 
68%  for  nine  trees.  However,  if  we  use  more 


than  nine  trees,  the  expected  accuracy  goes 
down,  eventually  converging  to 

For  a  set  of  examples  X,  where  p(z)  is  the 
probability  of  example  x  being  correctly  clas¬ 
sified  by  an  average  tree,  it  is  easy  to  show 
that  the  average  accuracy  without  voting  is 

while  the  accuracy  when  an  infinite  number 
of  trees  are  used  in  a  majority  computation 

|{x  €  X,p{x)  >  0.5)1 

|X| 

that  is,  the  fraction  of  the  examples  which 
are  more  than  likely  classified  correctly  by 
the  average  tree.  Between  these  two  ex¬ 
tremes,  the  overall  accuracy  may  have  dips 
and  peaks. 

In  this  paper,  we  try  majority  voting  using 
different  numbers  of  trees.  We  use  these  ex¬ 
periments  to  empirically  choose  a  value  for 
k  which  seems  to  work  well  in  practice. 

3  Related  Work 

fc-DT  is  one  of  several  different  strategies  for 
combining  multiple  classifiers.  There  are  two 
common  approaches  to  this  problem.  The 
first  approach  can  be  thought  of  as  multi¬ 
level  learning.  A  set  of  classifier?  are  trained. 
Their  outputs  are  fed  to  another  learning 
system,  which  learn  an  appropriate  weight¬ 
ing  scheme  on  the  first-level  classifiers,  in 
the  hopes  of  creating  a  more  accurate  classi¬ 
fier.  Depending  on  the  implementation,  the 
two  levels  can  be  trained  separately  or  simul¬ 
taneously.  Wolpert’s  [1992]  stacked  general¬ 
ization  technique  ane  the  hybrid  technique 
developed  by  Zhang,  et  al,  [1992])  are  ex¬ 
amples  of  separately  trained  systems.  An 
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Figure  1:  Majority  classification  probability  vs.  individual  classification  probability 
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Figure  2:  Effects  of  majority  voting  on  mixed  data  sets 
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example  of  a  simultaneously  trained  system 
is  [Jacobs,  et  a/,  1991],  in  which  the  second 
learning  level  learns  how  to  assign  training 
examples  to  the  different  components  of  the 
first  level. 

ib-DT  takes  another  approach.  Only  the  first 
level  is  trained;  the  second  level  is  a  simple, 
easily  understood,  fixed  network.  Another 
system  that  shares  this  property  is  the  clus¬ 
ter  back  propagation  network  of  Lincoln  et 
a/,  [1990] 


4  The  SADT  Algorithm 

Although  the  majority  voting  technique 
could  be  applied  to  any  randomized  classi¬ 
fier  scheme,  jfc— DT  was  first  conceived  of  as  a 
natural  enhancement  to  our  SADTalgorithm. 
Accordingly,  all  of  our  experiments  have 
been  conducted  on  the  SADTalgorithm.  To 
aid  in  the  understanding  of  fc-DT,  we  explain 
the  workings  of  our  SADTalgorithm  here. 

The  basic  outline  of  the  SADTalgorithm  is 
the  same  as  that  of  most  other  decision  tree 
algorithms.  That  is,  we  find  a  hyperplane  to 
partition  the  training  set  and  recurse  on  the 
two  partitions.  Here  we  describe  the  search 
for  a  good  hyperplane. 

In  our  implementa¬ 

tion,  d-dimensional  hyperplanes  are  stored 
in  the  form  H{x)  =  +  Yli=i  where 

H  =  {hi,h2^. . .  ihd+i}  is  the  hyperplane, 
X  =  {xi,X2,..., Xd)  is  a  point,  and  hd+i  rep¬ 
resents  the  constant  term.  For  example,  in 
the  plane  the  hyperplane  is  a  line  and  is  rep¬ 
resented  in  the  familiar  ax  +  by  +  c=  0  form. 
Classification  is  done  recursively.  To  classify 
an  example,  compare  it  to  the  current  hy¬ 
perplane  (initially  this  is  the  root  node).  If 
an  example  p  is  at  a  non-leaf  node  labeled 


H{x),  then  we  follow  the  the  left  child  if 
fiip)  >  0;  otherwise  we  descend  to  the  right 
child. 

The  first  step  in  our  algorithm  is  to  gener¬ 
ate  an  initial  hyperplane.  The  initial  hyper¬ 
plane  we  generate  is  always  the  same  and  is 
not  tailored  to  the  training  set.  We  simply 
wanted  to  choose  some  hyperplane  that  was 
not  parallel  to  any  of  the  axes,  so  we  used  the 
hyperplane  passing  through  the  points  where 
Xi  =  I  and  ail  other  Xj  —  0,  for  each  dimen¬ 
sion  t.  In  particular,  the  initial  hyperplane 
may  be  written  in  the  above  form  as  h,-  =  1 
for  1  <  t  <  d  and  hd+i  =  —1  since  H{x)  =  0 
for  each  of  these  points.  Thus  in  3-D,  we 
choose  the  hyperplane  which  passes  through 
(1,0,0),  (0,1,0),  and  (0,0,1).  Many  other 
choices  for  the  initial  hyperplane  would  be 
equally  good.  Once  the  annealing  begins, 
the  hyperplane  is  immediately  moved  to  a 
new  position,  so  the  location  of  the  initial 
split  is  not  important. 

Next,  the  hyperplane  is  repeatedly  per¬ 
turbed.  If  we  denote  the  current  hyper¬ 
plane  hy  H  =  {hi,h2,...,hd+i},  then  the 
algorithm  picks  one  of  the  h,-’s  randomly 
and  adds  to  it  a  uniformly  chosen  random 
variable  in  the  range  [—0.5, 0.5).  Using 
our  goodness  measure  (described  below),  we 
compute  the  energy  of  the  new  hyperplane 
and  the  change  in  energy  AE. 

If  AE  is  negative,  then  the  energy  has  de¬ 
creased  and  the  new  hyperplane  becomes  the 
current  split.  Otherwise,  the  energy  has  in¬ 
creased  (or  stayed  the  same)  and  the  new 
hyperplane  becomes  the  current  split  with 
probability  where  T  is  the  tempera¬ 

ture  of  the  system.  The  system  starts  out 
with  a  high  temperature  that  is  reduced 
slightly  with  each  move.  Note  that  when  the 
change  in  energy  is  small  relative  to  the  tem¬ 
perature,  the  probability  of  accepting  the 
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new  hyperplane  is  close  to  1,  but  that  as  the 
temperature  becomes  small,  the  probability 
of  moving  to  a  worse  state  approaches  0. 

In  order  to  decide  when  to  stop  perturbing 
the  split,  we  keep  track  of  the  split  that  gen¬ 
erated  the  lowest  energy  seen  so  far  at  the 
current  node.  If  this  minimum  energy  does 
not  change  for  a  large  munber  of  iterations 
(we  used  numbers  between  3000  and  100,000 
iterations  in  our  experiments),  then  we  stop 
making  perturbations  and  use  the  split  that 
generated  the  lowest  energy.  The  recursive 
splitting  continues  until  each  node  is  pure; 
i.e.,  each  leaf  node  contains  only  points  of 
one  category. 

4.1  Goodness  Criteria 

SADTcan  work  with  any  goodness  criterion, 
and  we  have  experimented  with  several  cri¬ 
teria.  For  det^led  discussions  of  these  mea¬ 
sures,  see  Heath  [1992].  In  this  paper,  we  ex¬ 
periment  with  three  of  these  criteria:  Quin¬ 
lan’s  [1986]  Information  Gain,  and  our  own 
Max  Minority  and  Sum  Minority.  We  define 
max  minority  and  sum  minority  as  follows. 

Consider  a  set  of  examples  X,  belonging 
to  2  classes,  u  and  v.  A  hyperplane  di¬ 
vides  the  set  into  two  subsets  X\  and  X2. 
For  each  subset,  we  find  the  class  that  ap¬ 
pears  least  often.  We  say  that  these  are 
the  minority  categories.  If  Xx  has  few  ex¬ 
amples  in  its  minority  category  C\,  then  it 
is  relatively  pure.  We  prefer  splits  that  are 
pure;  i.e.,  splits  that  generate  small  minori¬ 
ties.  Let  the  number  of  examples  in  class  u 
(class  v)  in  Xx  be  Ui  (vi)  and  the  number 
of  examples  in  class  u  (class  v)  in  X2  be  U2 
(vj).  To  force  SADTto  generate  a  relatively 
pure  split,  we  define  the  sum-minority  er¬ 
ror  measure  to  be  min(ui,i;i)  -f  min(u2,V2), 
and  the  max-minority  error  measure  to  be 


max(min(ui ,  vi),  min(tt2)  va))- 

5  Experiments 

5.1  Classifying  irises 

For  ovir  first  experiment,  we  ran  k-DT  cm 
a  real  dataset  that  has  bem  the  subject  of 
other  machine  learning  studies.  Fisher’s  iris 
data  is  a  well  known  dataset  (see  Fisher 
[1936]),  and  many  common  learning  tech¬ 
niques  have  been  applied  to  it.  Weiss  and 
Kapouleas  [1989]  compared  several  learning 
algorithms  on  the  iris  data  set,  as  well  as 
some  others.  The  data  consists  of  150  exam¬ 
ples,  50  each  of  three  different  types  of  irises: 
setosa,  versacolor,  and  virginica.  Each  ex¬ 
ample  is  described  by  numeric  measurements 
of  width  and  length  of  the  petals  and  sepals. 

We  performed  thirty-five  10-fold  cross  vali¬ 
dation  trials  using  SADT.  In  an  x-fold  cross- 
validation  trial,  we  divide  the  dataset  into  x 
approximately  equal  sized  subsets  and  per¬ 
form  X  experiments.  For  each  set  s,  we  train 
the  learning  system  on  the  union  of  the  re¬ 
maining  X  —  1  sets  and  test  on  set  s.  The 
results  are  averaged  over  these  x  runs. 

SADTcan  use  many  different  goodness  crite¬ 
ria  to  guide  its  search  for  good  trees.  We 
used  three  different  criteria:  our  own  max 
minority  and  sum  minority  and  Quinlan’s 
[1986]  information  gain.  Averaged  results 
are  shown  in  Table  1. 

Also  shown  in  the  table  is  the  accuracy  ob¬ 
tained  when,  for  each  training-  and  test-set 
pair,  we  take  the  majority  vote  of  11  trees 
when  classifying  the  test  set.  Note  that  the 
accuracy,  when  using  the  majority  voting 
scheme,  is  consistently  higher  than  when  us¬ 
ing  single  SADTtrees. 
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Goodness 

Criterion 

Average 
Error 
Rate  (%) 

Best  Accuracy  \ 

Number 
of  Trees 

WSM 

9 

SM 

30% 

33 

IG 

K9I 

13% 

5 

Table  1:  Iris  results 


Weiss  and  Kapouleas  [1989]  obtained  accura¬ 
cies  of  96.7%,  96.0%,  and  95.3%  with  back- 
propagation,  nearest  neighbor,  and  CART, 
respectively.  Their  results  were  generated 
with  leave-one-out  trials,  i.e.,  ISO-fold  cross 
validation. 

5.1.1  Choosing  a  value  for  k 

How  did  we  choose  A:  =  11  for  our  Ar-DT 
trees?  Intuitively,  it  may  seem  that  the  more 
trees  used  in  the  voting  process,  the  higher 
will  be  the  combined  accuracy.  However,  if 
an  example  is  somehow  ‘‘difficult”  to  classify, 
then  voting  will  only  make  it  less  likely  to 
classify  that  example  correctly. 

Figure  3  is  a  plot  of  average  classification  ac¬ 
curacy  on  the  iris  data  set,  as  the  number  of 
trees  in  the  voting  process  is  varied.  Note 
that  there  is  a  big  jump  in  accuracy  even 
when  only  three  trees  are  used.  The  max  mi¬ 
nority  and  information  gain  measures  peak 
fairly  early  and  begin  to  drop  off,  whereas 
the  sum  minority  measure  is  still  increasing 
in  siccuracy  at  thirty-five  trees. 

We  have  compromised  by  using  eleven  trees, 
which  appears  to  work  well  in  practice.  Ta¬ 
ble  1  shows  the  average  classification  accu¬ 
racy  when  using  eleven  trees  for  voting.  Also 
shown  is  the  classification  accuracy  for  the 
optimal  choice  of  k.  (The  optimal  choice  in 
the  table  is  limited  by  the  number  of  cross- 
validation  trials  we  have  run,  since  we  only 


had  that  many  trees  to  work  with).  The 
choice  of  11  trees  worked  well  for  the  iris 
dataset.  The  accuracy  obtained  with  this 
number  of  trees  was  at  least  as  good  as  any 
other  number  of  trees  we  tried  for  two  of  the 
energy  measures  and  still  quite  good  for  the 
third. 

At  this  point,  it  is  worth  considering  whether 
these  results  are  to  be  expected.  For  each 
example  x  in  the  iris  data  set,  we  computed 
the  percentage  p(x)  of  times  it  was  correctly 
classified  in  our  tests.  Figure  4  shows,  for 
a  given  percentage  p,  the  fraction  of  the  ex¬ 
amples  for  which  p(x)  =  p.  (Note  that  the 
figure  is  an  average  over  all  three  goodness 
criteria).  This  gives  us  a  rough  estimate  on 
the  probability  of  the  average  tree  classify¬ 
ing  that  example  correctly.  First,  note  that  a 
vast  majority  of  the  examples  are  always  or 
nearly  always  classified  correctly.  Approxi¬ 
mately,  4.4%  of  the  examples  are  predicted 
correctly  less  than  half  of  the  time.  These 
are  the  examples  that  we  would  expect  to 
be  classified  incorrectly  if  we  were  to  take  a 
majority  vote  over  a  large  number  of  trees. 
We  note  that  this  percentage  is  close  to  error 
rate  obtained  with  A:-DT. 

5.2  Applying  A;-DT  to  cancer 
diagnosis 

For  our  second  experiment,  we  chose  a 
dataset  that  has  been  the  subject  of  ex¬ 
periments  that  classified  the  data  using 
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Figure  3:  Iris  classification  accuracy  vs.  number  of  trees 


Figure  4:  Percentage  of  iris  examples  achieving  given  accuracy 


oblique  hyperplanes  (Mangasarian  et  al., 
1990).  This  dataset  contains  470  examples 
of  patients  with  breast  cancer,  and  the  di¬ 
agnostic  task  is  to  determine  whether  the 
cancer  is  benign  or  malignant.  The  input 
data  comprised  9  numeric  attributes,  hence 
our  decision  trees  used  oblique  hyperplanes 
in  9-D. 

Mangasarian’s  method  uses  linear  program¬ 
ming  to  find  pairs  of  hyperplanes  that  par¬ 
tition  the  data.  The  algorithm  finds  one 
pair  of  parallel  hyperplanes  at  a  time,  and 
each  pair  can  be  oriented  at  any  angle  with 
respect  to  all  other  pairs.  The  resulting 
model  is  a  set  of  oblique  hyperplanes,  similar 
though  not  identical  to  an  oblique  decision 
tree. 

Because  Mangasarian  et  al.  received  the  data 
as  they  were  collected  in  a  clinical  setting, 
their  experimental  design  was  very  simple. 
They  trained  their  algorithm  on  the  initial 
set  of  369  examples.  Of  the  369  patients, 
201  (54.5%)  had  no  malignancy  and  the  re¬ 
mainder  had  confirmed  malignancies.  On 
the  next  70  patients  to  enter  the  clinic,  they 
used  their  algorithm  for  diagnosis,  and  found 
that  it  correctly  diagnosed  68  patients.  We 
used  68/70  =  0.97  as  a  rough  estimate  of 
the  accuracy  of  Mangasarian  et  al.’s  method. 
They  then  re-trained  their  algorithm  using 
the  70  new  patients,  and  reported  that  it 
correctly  classified  all  of  the  next  31  patients 
to  enter  the  clinic.  Mangasarian  reported 
that  his  program’s  output  was  being  used  in 
an  actual  clinical  setting.  Using  the  same 
dataset  with  a  more  uniform  experimental 
design,  Salzberg  reported  that  the  Each 
hyper-rectangle  program  produced  95%  clas¬ 
sification  accuracy,  and  l-nearest-neighbor 
had  94%  accuracy  (Salzberg,  1991). 

The  results  of  our  tests  on  this  data  are 
shown  in  Table  2.  The  average  values  are  the 


average  of  thirty-six  10-fold  cross-validation 
trials.  Once  again,  the  accuracy  obtained  by 
using  an  11 -tree  majority  classifier  is  consis¬ 
tently  higher  than  that  of  the  average  tree. 
In  this  example,  the  sum  minority  goodness 
criterion  did  quite  a  bit  better  on  average 
than  the  other  two,  but  it  benefited  less  from 
the  use  of  the  majority  technique.  It  is  pos¬ 
sible  that  by  taking  the  majority,  we  are  able 
to  overcome  weaknesses  in  the  other  two  cri¬ 
teria  that  are  not  as  significant  with  sum  mi¬ 
nority. 

We  also  see  that  using  eleven  trees  is  a  good 
choice  for  this  dataset  as  well.  Only  for  the 
max  minority  energy  measure  was  there  a 
noticeable  difference  in  accuracy  between  the 
optimal  choice  for  the  number  of  trees  and 
our  choice  of  1 1 . 

5.3  Identifying  stars  and  galaxies 

In  order  to  study  the  performance  of  k- 
DT  on  larger  datasets,  we  ran  several  ex¬ 
periments  using  astronomical  image  data 
collected  with  the  University  of  Minnesota 
Plate  Scanner.  This  dataset  contains  several 
thousand  astronomical  objects,  all  of  which 
are  classified  as  either  stars  or  galaxies.  Ode- 
wahn  et  al.  [1992]  used  this  dataset  to  train 
perceptrons  and  backpropagation  networks 
to  differentiate  between  stars  and  galaxies. 

We  did  not  have  access  to  the  exact  training 
and  test  set  partitions  used  by  Odewahn  et 
al.,  so  we  used  a  cross-validation  technique 
to  estimate  classification  accuracy.  The  Ode¬ 
wahn  et  al  study  used  a  single  training/test 
set  partition.  Although  our  results  may  not 
be  completely  comparable  to  theirs,  we  in¬ 
clude  them  to  show  that  both  learning  meth¬ 
ods  produce  similar  accuracies.  Our  results 
were  generated  by  averaging  nineteen  10- 
fold  cross-validation  trials. 
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Goodness 

Criterion 

Average 
Error 
Rate  (%) 

Error  rate 
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11  trees 

Error 

Rate 

Number 
of  Trees 

7.3 

34% 

4.4 

33 

SM 

5.1 

K9 

13% 

4.3 

23 

IG 

6.7 

mmi 

27% 

4.9 

11 

Table  2:  Breast  cancer  malignancy  results 


Goodness 

Criterion 

Average 
Error 
Rate  (%) 

Error  rate 
with 

11  trees 

Reduction 

in 

Error 

Error 

Rate 

Number 
of  Trees 

58% 

11 

mSM 

Bn 

■H 

1.0 

0.6 

40% 

0.6 

Table  3:  Star/galaxy  results 


The  astronomy  dataset  consists  of  4164  ex¬ 
amples.  Each  example  has  fourteen  real¬ 
valued  attributes  and  a  label  of  either  “star” 
or  “galaxy.”  Approximately  35%  of  the  ex¬ 
amples  are  galaxies. 

Classification  results  are  shown  in  Table  3. 
Odewahn  et  al.  [1992]  obtained  accuracies 
of  99.7%  using  backpropagation  and  99.4% 
with  a  perceptron.  It  appears,  however, 
that  their  results  were  generated  with  a  sin¬ 
gle  trial  on  a  single  partition  into  test-  and 
training-set.  In  fact,  we  obtained  a  ten-fold 
cross-validated  accuracy  of  99.1  using  a  per¬ 
ceptron. 

Using  a  majority  classifier  increased  classi¬ 
fication  accuracy  for  this  dateiset,  as  in  the 
other  studies.  For  the  max  minority  good¬ 
ness  criterion,  we  were  able  to  reduce  the 
error  rate  by  almost  60%.  Using  eleven  trees 
for  the  majority  classification  was  a  good 
choice  for  this  dataset.  The  results  for  eleven 
trees  were  at  least  as  good  as  for  any  other 
number  of  trees  (up  to  15,  the  number  of 
cross-validation  trials  we  ran). 


5.4  Comparison  with  other  methods 

In  [Heath,  1992],  we  explored  several  tech¬ 
niques  of  taking  advantage  of  randomization 
in  learning  algorithms.  Our  focus  in  that 
work  was  on  techniques  that  generate  many 
trees,  and  use  some  additional  criteria  to  se¬ 
lect  the  best  tree,  which  we  then  measure  on 
the  testing  set.  In  this  section,  we  compare 
those  techniques  to  the  majority  classifica¬ 
tion  technique. 

One  of  our  criteria  for  choosing  the  “best” 
tree  was  to  choose  the  smallest  trees.  The  in¬ 
tuition  behind  this  technique  is  that  smaller 
trees  may  be  more  concise  descriptions  of  the 
problem  domain,  less  sensitive  to  noise  in  the 
training  data,  and  have  a  lower  chance  of  be¬ 
ing  generated  through  overtraining.  For  each 
of  the  ten  pairs  of  training  and  testing  sets  in 
a  10-fold  cross-validation,  we  generated  sev¬ 
eral  SADTtrees,  and  then  chose  the  smallest. 
We  then  averaged  the  accuracy  2Uid  size  of 
the  ten  chosen  trees.  If,  for  given  training 
and  testing  sets,  there  was  more  than  one 
smallest  tree,  we  averaged  them,  before  av¬ 
eraging  them  with  the  other  nine. 
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In  another  experiment  in  [Heath,  1992]  we 
split  the  training  set  70/30  and  trained  only 
using  70%  of  the  training  set.  The  other  30% 
was  used  as  a  second  test  set.  We  used  it  to 
test  the  tree  and  assign  it  a  figure  of  merit. 
We  ran  this  several  times,  choosing  different 
70/30  splits  each  time  and  choosing  the  trees 
with  the  highest  figures  of  merit.  We  then 
tested  those  trees  on  the  real  test  set. 

In  Table  4,  we  compare  fc-DT  with  these 
two  approaches.  All  three  techniques  gave 
some  improvement  in  accuracy,  although  the 
method  of  choosing  trees  by  size  was  not  very 
consistent.  In  some  cases,  small  trees  were 
actually  worse  than  average  trees.  fc-DTs 
always  performed  better  than  using  a  sepa¬ 
rate  test  set  to  judge  trees.  It  nearly  always 
performed  better  than  picking  the  smallest 
trees.  The  only  exception  to  this  was  for 
two  goodness  criteria  used  on  the  iris  data 
set.  The  disadvantage  to  A:-DTs,  of  course, 
is  that  they  are  not  trees,  but  rather  collec¬ 
tions  of  trees.  Thus  the  representation  cre¬ 
ated  by  A;-DT  is  not  as  compact  as  with  a 
single  tree. 


6  Summary  of  Results 

We  have  explored  the  idea  of  using  the  ran¬ 
domization  inherent  in  some  learning  tech¬ 
niques  to  advantage,  by  generating  a  number 
of  classifiers  and  combining  them  with  a  ma¬ 
jority  voting  scheme.  We  have  experimented 
with  this  technique  on  SADT,  a  randomized 
oblique  decision  tree  algorithm. 

Our  results  show  that  majority  classifiers  for 
SADTconsistently  perform  better  than  aver¬ 
age  SADTtrees,  which  in  turn  perform  better 
than  standard  axis-parallel  trees  (see  Heath, 
1992).  The  consistency  and  degree  of  im¬ 
provement  is  better  than  other  techniques 


we  have  considered  for  increasing  accuracy 
through  randomization. 

This  work  is  preliminary;  we  have  not  tried 
to  apply  the  majority  technique  to  other 
types  of  randomized  learning  algorithms. 
However,  this  is  a  clear  opportunity  for  fu¬ 
ture  experiments.  We  would  also  like  to  ex¬ 
plore  combining  this  technique  with  other 
techniques.  For  example,  we  would  like  to 
try  the  majority  technique  on  trees  which  are 
smaller  than  average,  to  see  if  we  can  get  any 
further  improvements  in  accuracy.  It  is  pos¬ 
sible  that  for  some  applications,  the  added 
complexity  of  a  majority  classifier  can  be  a 
disadvantage.  We  are  exploring  ways  that 
might  allow  us  to  combine  several  trees  in 
a  majority-like  way,  yet  still  end  up  with  a 
small  tree  structure. 
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Abstract 

Meta-leaming  is  proposed  as  a  general  tech¬ 
nique  to  combine  a  number  of  separate  clas¬ 
sifiers  for  machine  learning  tasks.  A  number 
of  possible  approaches  for  meta-leaming  are 
proposed  including  the  use  of  a  single  learn¬ 
ing  algorithm  as  well  as  a  number  of  different 
learning  algorithms.  This  paper  describes  meta- 
leaming  strategies  for  combining  independently 
learned  classifiers  by  different  algorithms  to  im¬ 
prove  overall  accuracy.  We  also  present  several 
meta-leaming  strategies  for  combining  learned 
classifiers  by  a  number  of  parallel  learning  pro¬ 
cesses  on  subsets  of  training  data  using  different 
algorithms.  The  strategies  are  independent  of 
the  learning  algorithms  used. 

Keywords:  concept  learning,  meta-leaming, 
and  parallel  and  distributed  processing. 

1  Introduction 

Most  research  in  concept  learning  (or  learning 
from  examples)  focuses  on  the  conception  and 
evaluation  of  distinct  learning  strategies  embod¬ 
ied  by  an  individual  algorithm.  Since  different 


algorithms  have  different  representations  and 
search  heuristics,  different  search  spaces  are  be¬ 
ing  explored  and  hence  potentially  diverse  re¬ 
sults  can  be  obtained  from  different  algorithms. 
Mitchell  (1980)  refers  to  this  phenomenon  as 
inductive  bias.  That  is,  the  outcome  of  running 
an  algorithm  is  biased  in  a  certain  direction. 
Furthermore,  different  data  sets  have  different 
characteristics  and  the  performance  of  different 
algorithms  on  these  data  sets  might  differ.  In 
other  words,  to  date  there  is  no  single  algorithm 
that  works  best  on  all  kinds  of  data  sets.  Hence, 
it  is  beneficial  to  build  a  framework  that  allows 
different  learning  algorithms  to  be  used  in  di¬ 
verse  situations. 

Recently,  many  researchers  have  proposed  im¬ 
plementing  learning  systems  by  integrating  in 
some  fashion  a  number  of  different  strategies 
and  algorithms  to  boost  overall  accuracy.  The 
basic  notion  behind  this  integration  is  to  com¬ 
plement  the  different  underlying  learning  strate¬ 
gies  embodied  by  different  learning  algorithms 
by  effectively  reducing  the  space  of  incorrect 
classifications  of  a  learned  concept  There  are 
many  ways  of  integrating  different  learning  al¬ 
gorithms.  For  example,  work  on  integrating 
concept  and  explanation-based  learning  (Flann 
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&Dietterich,  1989;  Towelletal.,  1990)  requires 
a  complicated  new  algorithm  that  implements 
both  approaches  to  learning  in  a  single  system. 
Another  line  of  work  focuses  on  combining  dif¬ 
ferent  learning  systems  in  a  loose  fashion.  For 
example.  Silver  et  al.’s  (1990)  work  on  using  a 
coordinator  to  gather  votes  from  three  different 
learners  and  Holder’s  (1991)  work  on  selecting 
learning  strategies  based  on  their  relative  utility. 
One  advantage  of  this  approach  is  its  simplic¬ 
ity  in  treating  the  individual  learning  systems 
as  black  boxes  with  little  or  no  modification  re¬ 
quired  to  achieve  a  final  system.  Therefore,  in¬ 
dividual  systems  can  be  added  or  replaced  with 
relative  ease. 

A  more  interesting  approach  to  loosely  com¬ 
bine  learners  is  to  learn  how  to  combine  in¬ 
dependently  learned  concepts.  Stolfo  et  al.’s 
work  (1989)  attempts  to  learn  rules  for  merging 
different  phoneme  output  representations  from 
multiple  trained  speech  recognizers.  Wolpert 
(1992)  presents  a  theory  on  stacked  generaliza¬ 
tion  {meta-learning).  Several  (level  0)  classi¬ 
fiers  are  first  learned  from  the  same  training  set. 
The  predictions  made  by  these  classifiers  on  the 
training  set  and  the  correct  classifications  form 
the  training  set  of  the  next  level  (level  1)  clas¬ 
sifier.  When  an  instance  is  being  classified,  the 
level  0  classifiers  first  make  their  predictions  on 
the  instance.  The  predictions  are  then  presented 
to  the  level  1  classifier,  which  makes  the  final 
prediction.  Zhang  et  al.’s  (1992)  work  utilizes 
a  similar  approach  to  learn  a  combiner  based  on 
the  predictions  made  by  three  different  classi¬ 
fiers. 

Furthermore,  much  of  the  research  in  concept 
learning  concentrates  on  problems  with  rela¬ 
tively  small  amounts  of  data,  "^th  the  com¬ 
ing  age  of  “high-capacity”  and  “light-speed” 
networks,  it  is  likely  that  orders  of  magnitude 
more  data  will  be  available  for  various  learn¬ 
ing  problems  of  real  world  importance.  The 
Grand  Challenges  of  HPCC  (Wah  et  al.,  1993) 


are  perhaps  the  best  examples. 

Quinlan  (1979)  approached  the  problem  of  ef¬ 
ficiently  applying  learning  systems  to  data  that 
are  substantially  larger  than  available  memory 
with  a  windowing  technique.  A  learning  algo¬ 
rithm  is  applied  to  a  small  subset  of  training 
data,  called  a  window,  and  the  learned  concept 
is  tested  on  the  remaining  training  data.  This 
is  repeated  on  a  new  window  of  the  same  size 
with  some  of  the  incorrectly  classified  data  re¬ 
placing  some  of  the  data  in  the  old  window  un¬ 
til  all  the  data  are  correctly  classified.  Wirth 
and  Catlett  (1988)  show  that  the  windowing 
technique  does  not  significantly  improve  speed 
on  reliable  data.  On  the  contrary,  for  noisy 
data,  windowing  considerably  slows  down  the 
computation.  Catlett  (1991)  demonstrates  that 
larger  amounts  of  data  improves  accuracy,  but 
he  projects  that  ID3  (Quinlan,  1986)  on  modem 
machines  will  take  several  months  to  learn  from 
a  million  records  in  the  flight  data  set  obtained 
from  NASA.  He  proposes  some  improvements 
to  the  ID3  algorithm,  but  his  scheme  is  limited 
to  real-numbered  attributes  and  the  complex¬ 
ity  is  still  prohibitive  for  large  amounts  of  data 
(Chan  &  Stolfo,  1993c),  Typical  learning  sys¬ 
tems  like  ID3  cannot  handle  data  that  exceed 
the  size  of  a  monolithic  memory  on  a  single 
processor.  We  believe  parallel  and  distributed 
processing  with  divide-and-conguer  techniques 
provides  the  best  hope  of  dealing  with  such  large 
amounts  of  data  in  terms  of  speed  and  memory 
requirement.  But  how  precisely  does  one  orga¬ 
nize  and  implement  a  parallel  processing  system 
for  machine  learning  tasks? 

One  approach  to  this  problem  is  to  parallelize 
the  learning  algorithms.  Zhang  et  al.’s  (1989) 
work  on  parallelizing  the  backpropagation  algo¬ 
rithm  on  a  Connection  Machine  is  one  example. 
This  approach  requires  optimizing  the  code  of  a 
particular  algorithm  for  a  specific  architecture. 
Our  approach  is  to  run  the  serial  code  on  a  num¬ 
ber  of  data  subsets  in  parallel  and  combine  the 
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results  in  an  intelligent  fashion.  This  q)proach 
has  the  advantage  of  using  the  same  serial  code 
without  the  time-consuming  process  of  paral¬ 
lelizing  it.  In  addition,  our  proposed  framework 
for  combining  the  results  of  learned  concepts 
is  independent  of  the  learning  algorithms  and 
the  computing  platform  used.  However,  this 
approach  caimot  guarantee  the  accuracy  of  the 
learned  concepts  to  be  the  same  as  the  serial  ver¬ 
sion  since  a  considerable  amount  of  information 
may  not  be  accessible  to  each  of  the  learners. 

In  this  paper  we  present  the  concept  of  meta¬ 
learning,  introduced  in  (Chan  &  Stolfo,  1993c), 
and  its  use  in  coalescing  the  results  from  mul¬ 
tiple  concept  learning  systems  to  improve  ac¬ 
curacy  and  the  results  achieved  from  a  set  of 
parallel  or  distributed  learning  processes  to  im¬ 
prove  learning  speed.  The  ultimate  goal  of  this 
work  is  to  improve  both  the  accuracy  and  ef¬ 
ficiency  of  machine  learning  by  means  of  par¬ 
allel  processing  of  multiple  learning  systems 
applied  to  massive  amounts  of  training  data. 
There  are  many  ways  one  might  imagine  to 
combine  learned  classifiers.  For  this  paper, 
we  detail  only  a  few  approaches.  Thus,  this 
work  may  be  viewed  as  exploratory  to  deter¬ 
mine  the  efficacy  of  the  general  approach.  Sec¬ 
tion  2  discusses  meta-leaming  and  how  it  facil¬ 
itates  multistrategy  and  parallel  learning.  Sec¬ 
tion  3  details  our  strategies  for  boosting  accu¬ 
racy  by  meta-leaming,  which  appear  in  (Chan 
&  Stolfo,  1993a).  Section  4  discusses  our  ap¬ 
proach  to  improve  speed  and  accuracy  through 
meta-leaming.  Section  5  concludes  with  our 
findings  and  work  in  progress. 

2  Meta-learning 

Meta-learning  can  be  loosely  defined  as  learn¬ 
ing  from  information  generated  by  a  leamer(s). 
It  can  also  be  viewed  as  the  learning  of  meta¬ 
knowledge  on  the  learned  information.  In  our 
work  we  concentrate  on  learning  from  the  out¬ 


put  of  concept  learning  systems.  In  this  case 
meta-leaming  means  learning  from  the  classi¬ 
fiers  produced  by  the  learners  and  the  predic¬ 
tions  of  these  classifiers  on  training  data.  A 
classifier  (or  concept)  is  the  ou^ut  of  a  concept 
learning  system  and  a  prediction  (or  classifica¬ 
tion)  is  the  predicted  class  generated  by  a  clas¬ 
sifier  when  an  instance  is  supplied.  That  is,  we 
are  interested  in  the  output  of  the  learners,  not 
the  learners  themselves.  Moreover,  the  training 
data  presented  to  the  learners  initially  are  also 
available  to  the  meta-learner  if  warranted. 

In  essence  we  use  multiple  strategies  to  improve 
accuracy  and  parallelism  to  improve  speed. 
The  use  of  meta-leaming  can  facilitate  reach¬ 
ing  these  goals.  This  is  demonstrated  by  four 
frameworks  supported  by  meta-leaming  as  fol¬ 
lows: 

1 .  Hypothesis  boosting  (HB)  involves  the  im¬ 
provement  of  accuracy  of  a  learning  algo¬ 
rithm  by  meta-leaming.  A  number  of  in¬ 
stances  of  a  single  learning  algorithm  are 
applied  to  distinguished  subsets  of  train¬ 
ing  data  that  are  composed  in  such  a  way 
as  to  improve  the  overall  prediction  accu¬ 
racy  (Schapire  (1990)  calls  this  hypothe¬ 
sis  boosting).  Based  on  an  initial  learned 
hypothesis  for  some  concept  derived  from 
a  random  distribution  of  training  data, 
Schapire  (1990)  generates  two  additional 
distributions  of  examples,  to  which  the 
learning  algorithm  is  then  applied.  The 
three  different  distributions  are  interrelated 
and  generated  successively.  The  predic¬ 
tions  of  the  three  learned  classifiers  are 
combined  using  a  simple  arbitration  mle. 
Although  his  approach  is  limited  to  the 
PAC  model  of  learning,  some  success  was 
achieved  in  the  domain  of  character  recog¬ 
nition,  using  neural  networks  (Dmcker  et 
al.  1993).  Freund  (1990)  has  a  similar 
approach,  but  with  potentially  many  more 
distributions.  This  framework  focuses  pri- 


marily  on  improving  the  accuracy  of  an 
individual  learner. 

2.  Parallel  learning  (PL)  involves  applying 
a  single  algorithm  on  different  subsets  of 
the  data  in  parallel  and  the  use  of  meta- 
leaming  to  combine  the  partial  results.  Un¬ 
like  hypothesis  boosting,  the  subsets  are 
independent  and  can  be  generated  concur¬ 
rently.  This  approach  attempts  to  improve 
speed,  not  accuracy,  via  parallelism.  Not 
much  work  by  others  has  been  done  in  ap¬ 
plying  meta-learning  to  parallel  learning. 

3.  Multistrategy  hypothesis  boosting  (MSHB) 
involves  applying  multiple  algorithms  on 
the  same  set  of  data  and  the  results  of 
the  learned  concepts  are  combined  by 
meta-leaming  to  improve  overall  accuracy. 
The  aforementioned  approaches  used  by 
Wolpert(  1992)  and  22hangetal.  (1992)  are 
examples  of  this  strategy.  This  approach 
attempts  to  take  advantage  of  the  diversity 
of  learners  to  increase  accuracy. 

4.  Multistrategy  parallel  learning  (MSPL)  is 
a  combination  of  parallel  learning  and 
multistrategy  hypothesis  boosting.  Multi¬ 
ple  learning  algorithms  are  applied  to  sub¬ 
sets  of  the  data  in  parallel.  This  framework 
tries  to  improve  both  speed  via  parallelism 
and  accuracy  via  diversity.  To  our  knowl¬ 
edge,  not  much  research  by  others  has  been 
attempted  in  this  framework,  other  than  the 
proposed  work  in  (Stolfo  et  al.,  1989). 

Our  work  has  concentrated  on  parallel  learning 
and  multistrategy  hypothesis  boosting,  which, 
we  believe,  will  provide  some  insights  in  how 
to  achieve  multistrategy  parallel  learning.  In 
the  rest  of  this  paper  we  will  discuss  our  ap¬ 
proach  to  the  multistrategy  hypothesis  boosting 
and  multistrategy  parallel  learning  frameworks. 


3  Multistrat^  Hypoth^is 
Boosting 

The  objective  here  is  to  improve  prediction 
accuracy  by  exploring  the  diversity  of  multi¬ 
ple  learning  algorithms  through  meta-leaming. 
This  is  achieved  by  a  basic  configuration  which 
has  several  different  base  learners  and  one 
meta-learner  that  learns  from  the  ou^ut  of  the 
base  leaners.  The  meta-leamer  may  employ  the  . 
same  algorithm  as  one  of  the  base  learners  or  a 
completely  distinct  algorithm.  Each  of  the  base 
learners  is  provided  with  the  entire  training  set 
of  raw  data.  However,  the  training  set  for  the 
meta-leamer  varies  according  to  the  strategies 
described  below,  and  is  quite  different  than  the 
data  used  in  training  the  base  classifiers.  Note 
that  the  meta-leamer  does  not  aim  at  picking  the 
“best”  classifier  generated  by  the  base  learners; 
instead  it  tries  to  combine  the  classifiers.  That 
is,  the  prediction  accuracy  of  the  overall  system 
is  not  limited  to  the  most  accurate  base  learner. 

It  is  our  intent  to  generate  an  overall  system  that 
outperforms  the  underlying  base  learners. 

There  are  in  general  two  types  of  information 
the  meta-leamer  can  combine:  the  learned  base 
classifiers  and  the  predictions  of  the  learned 
base  classifiers.  The  first  type  of  information 
consists  of  concept  descriptions  in  the  base 
classifiers  (or  concepts).  Some  common  con¬ 
cept  descriptions  are  represented  in  the  form 
of  decision  trees,  mles,  and  networks.  Since 
we  are  aiming  at  diversity  in  the  base  learn¬ 
ers,  the  learning  algorithms  chosen  usually  have 
different  representations  for  their  learned  classi¬ 
fiers.  Hence,  in  order  to  combine  the  classifiers, 
we  need  to  define  a  common  representation  to 
which  the  different  learned  classifiers  are  trans¬ 
lated.  However,  it  is  difficult  to  define  such  a 
representation  to  encapsulate  aU  the  represen¬ 
tations  without  losing  a  significant  amount  of 
information  during  the  translation  process.  For 
instance,  it  is  very  difficult  to  define  a  com- 
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mon  representation  to  integrate  the  discrimi¬ 
nant  functions  and  exemplars  computed  by  a 
nearest-neighbor  learning  algorithm  with  the 
tree  computed  by  a  decision  tree  learning  al¬ 
gorithm.  Because  of  this  difficulty,  one  might 
define  a  uniform  representation  that  limits  the 
types  of  representation  that  can  be  supported 
and  hence  the  choice  of  learning  algorithms. 

An  alternative  strategy  is  to  integrate  the  predic¬ 
tions  of  the  learned  classifiers  for  the  training  set 
leaving  the  internal  organization  of  each  classi¬ 
fier  completely  transparent.  These  predictions 
are  hypothesized  classes  present  in  the  training 
data  and  can  be  categorical  or  associated  with 
some  numeric  measure  (e.g.,  probabilities,  con¬ 
fidence  values,  and  distances).  In  this  case,  the 
problem  of  finding  a  common  ground  is  much 
less  severe.  For  instance,  classes  with  numeric 
measures  can  be  treated  as  categorical  (by  pick¬ 
ing  the  class  with  the  highest  value).  Since  any 
learner  can  be  employed  in  this  case,  we  focus 
our  work  on  combining  predictions  from  the 
learned  classifiers.  Moreover,  since  convert¬ 
ing  categorical  predictions  to  ones  with  numeric 
measures  is  undesirable  or  impossible,  we  con¬ 
centrate  on  combining  categorical  predictions. 

We  experimented  with  three  types  of  meta¬ 
learning  strategies  {combiner,  arbiter,  and  hy¬ 
brid)  for  combining  predictions,  which  we  dis¬ 
cuss  in  the  following  sections.  For  pedagogi¬ 
cal  reasons,  our  discussion  assumes  three  base 
learners  and  one  meta-leamer. 

3.1  Combiner  strategy 

In  the  combiner  strategy,  the  predictions  for  the 
training  set  generated  by  a  two-fold  cross  val¬ 
idation  technique  using  the  base  learners  form 
the  basis  of  the  meta-learner’s  training  set  (de¬ 
tails  in  (Chan  &  Stolfo,  1993a)).  A  composition 
rule,  which  varies  in  different  schemes,  deter¬ 
mines  the  content  of  training  examples  for  the 
meta-leamer.  From  these  examples,  the  meta- 


leamer  generates  a  meta-classifier,  that  we  call 
a  combiner.  In  classifying  an  instance,  the  base 
classifiers  first  generate  their  predictions.  Based 
on  the  same  composition  rule,  a  new  instance 
is  generated  from  the  predictions,  which  is  then 
classified  by  the  combiner.  The  aim  of  this  strat¬ 
egy  is  to  coalesce  the  predictions  from  the  base 
classifiers  by  learning  the  relationship  between 
these  predictions  and  the  correct  prediction. 

We  experimented  with  three  schemes  for  the 
composition  rule.  First,  three  predictions, 
Czix),  and  C-i{x),  for  each  example  x 
in  the  original  training  set  of  examples,  E,  are 
generated  by  three  separate  classifiers,  Ci,  Cz, 
and  (^3.  These  predicted  classifications  are  used 
to  form  a  new  set  of  “meta-level  training  in¬ 
stances,”  T,  which  is  used  as  input  to  a  learning 
algorithm  that  computes  a  combiner.  The  man¬ 
ner  in  which  T  is  computed  varies  according  to 
the  schemes  as  defined  below.  In  the  following 
definitions,  class{x)  denotes  the  correct  classi¬ 
fication  of  example  x  as  specified  in  training  set 
E. 

1.  Return  meta-level  training  instances  with 
the  correct  classification  and  the  predic¬ 
tions;  i.e.,  T  =  {(cZa.s.s(x),Ci(x),(72(x), 
6''3(x))  I  X  €  E}.  This  scheme  was  also 
used  by  Wolpert  (1992).  (For  further  ref¬ 
erence,  this  scheme  is  denoted  as  meta¬ 
class.)  A  sample  training  set  is  depicted  in 
Figure  1. 

2.  Return  meta-level  training  instances  sim¬ 
ilar  to  those  in  the  first  {meta-class) 
scheme  with  the  addition  of  the  original 
attribute  vectors  in  the  training  examples; 
i.e.,  r  =  {(c/a.s.s(x),Ci(x),C'2(x),(73(x), 
attrvec{x))  [  x  €  E}.  (Henceforth, 
this  scheme  is  denoted  as  meta-class- 
attribute.) 

3.  Return  meta-level  training  instances  sim¬ 
ilar  to  those  in  the  meta-class  scheme 
except  that  each  prediction,  C,(x),  has 


m  binary  predictions, 
where  rn  is  the  number  of  classes.  Each 
prediction,  C,j(x),  is  produced  from  a  bi¬ 
nary  classifier,  which  is  trained  on  exam¬ 
ples  that  are  labeled  with  classes  j  and 
-'j.  In  other  words,  we  are  using  more 
specialized  base  classifiers  and  attempt¬ 
ing  to  leam  the  correlation  between  the 
binary  predictions  and  the  correct  predic¬ 
tion.  For  concreteness,  T  =  {( class (x), 
Cl,  (x), . . . ,  Ci„(x),  C2,  (x), . . . ,  C2„.(x), 
Cs, (x),. . .  ,C3„(x))  I  X  €  E}.  (Hence¬ 
forth,  this  scheme  is  denoted  as  meta-class- 
binary.) 

These  three  schemes  for  the  composition  rule 
are  defined  in  the  context  of  forming  a  train¬ 
ing  set  for  the  combiner.  These  composition 
rules  are  also  used  in  a  similar  manner  during 
classification  after  a  combiner  has  been  com¬ 
puted.  Given  a  test  instance  whose  classifica¬ 
tion  is  sought,  we  first  compute  the  classifica¬ 
tions  predicted  by  each  of  the  base  classifiers. 
The  composition  rule  is  then  applied  to  generate 
a  single  meta-level  test  instance,  which  is  then 
classified  by  the  combiner  to  produce  the  final 
predicted  class  of  the  original  test  instance. 

3J2  Arbiter  strategy 

In  the  arbiter  strategy,  the  training  set  for  the 
meta- learner  is  a  subset  of  the  training  set  for  the 
base  learners.  That  is,  the  meta-level  training 
instances  are  a  particular  distribution  of  the  raw 
training  set  E.  The  predictions  of  the  learned 
base  classifiers  for  the  training  set  and  a  selec¬ 
tion  rule,  which  varies  in  different  schemes,  de¬ 
termines  which  subset  will  constitute  the  meta- 
leamer’s  training  set.  (This  contrasts  with  the 
combiner  strategy  which  has  the  same  number 
of  examples  for  the  base  classifier  as  for  the 
combiner.  Also,  the  meta-level  instances  of  the 
combiner  strategy  incorporate  additional  infor¬ 
mation  than  just  the  raw  training  data.)  Based 
on  this  training  set,  the  meta- learner  generates  a 


meta-classifier,  in  this  case  called  an  arbiter.  In 
classifying  an  instance,  the  base  classifiers  first 
generate  their  predictions.  These  predictions, 
together  with  the  arbiter  and  a  corresponding 
arbitration  rule,  generate  the  final  predictions. 
(This  contrasts  with  the  multi-level  arbiter  trees 
discussed  in  Section  4.1.)  In  this  strategy  one 
learns  to  arbitrate  among  the  potentially  dif¬ 
ferent  predictions  from  the  base  classifiers,  in¬ 
stead  of  learning  to  coalesce  the  predictions  as 
in  the  combiner  strategy.  We  first  describe  the 
schemes  for  the  selection  rule  and  then  those  for 
the  arbitration  rule. 

We  experimented  with  two  schemes  for  the  se¬ 
lection  rule,  which  chooses  training  examples 
for  an  arbiter.  In  essence  the  schemes  select  ex¬ 
amples  that  are  confusing  to  the  three  base  clas¬ 
sifiers,  from  which  an  arbiter  is  learned.  Based 
on  three  predictions,  Ci(x),  C2(x),  and  C-i{x), 
for  each  example  x  in  a  set  of  training  exam¬ 
ples,  E,  each  scheme  generates  a  set  of  training 
examples,  T  (C  £),  for  the  arbiter.  The  two 
versions  of  this  selection  rule  implemented  and 
reported  here  include; 

1.  Return  instances  with  predictions  that  dis¬ 
agree;  i.e.,  T  =  Td  =  {x  e  E  \  (Ci(x)  ^ 
^2(0;))  V  (C2(x)  ^  C3(x))}.  Thus,  in¬ 
stances  with  conflicting  predictions  are 
used  to  train  the  arbiter.  However,  in¬ 
stances  with  predictions  that  agree  but  are 
incorrect  are  not  included.  (We  refer  to 
this  scheme  as  meta-different.)  A  sample 
training  set  is  depicted  in  Figure  1. 

2.  Return  instances  with  predictions  that  dis¬ 
agree,  Td,  as  in  the  first  case  {meta- 
different),  but  also  instances  with  predic¬ 
tions  that  agree  but  are  incorrect;  i.e,  T  = 
Td  U  Ti,  where  Ti  =  {x  £  E  \  (Ci(x)  = 
C'2(x)  =  C'3(x))  A  {class{x)  ^  Ci(x))}. 
Note  that  we  compose  both  cases  of  in¬ 
stances  that  are  incorrectly  classified  or  are 
in  disagreement.  (Henceforth,  we  refer  to 
this  case  as  meta-different-incorrect.) 
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Figure  1:  Sample  training  sets  generated  by  the  combiner  and  arbiter  strategies 
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The  arbiters  are  trained  by  some  learning  al¬ 
gorithm  on  the  particular  distinguished  distri¬ 
butions  of  training  data  and  are  used  in  gener¬ 
ating  predictions.  During  the  classification  of 
an  instance,  y,  the  learned  arbiter.  A,  and  the 
corresponding  arbitration  rule,  produce  a  final 
prediction  based  on  the  three  pre^ctions,  Ci  (y), 
C2{y),  and  Csiy),  from  the  three  base  classifiers 
and  the  arbiter’s  own  prediction,  A{y).  The  fol¬ 
lowing  arbitration  rule  applies  to  both  schemes 
for  the  selection  rule  described  above. 

1&2.  Return  the  simple  vote  of  the  base  and  ar¬ 
biter’s  predictions,  breaking  ties  in  favor 
of  the  arbiter’s  prediction;  i.e.,  if  there  are 
no  ties,  return  vote{Ci{y),C2{y),Cj{y), 
A{y)),  otherwise  return  A{y). 

33  Hybrid  strategy 

We  integrate  the  combiner  and  arbiter  strategies 
in  the  hybrid  strategy.  Given  the  predictions  of 
the  base  classifiers  on  the  original  training  set, 
a  selection  rule  picks  examples  from  the  train¬ 
ing  set  as  in  the  arbiter  strategy.  However,  the 
training  set  for  the  meta-leamer  is  generated  by 
a  composition  rule  applied  to  the  distribution 
of  training  data  (a  subset  of  E)  as  defined  in 
the  combiner  strategy.  Thus,  the  hybrid  strat¬ 
egy  attempts  to  improve  the  arbiter  strategy  by 


correcting  the  predictions  of  the  “confused”  ex¬ 
amples.  It  does  so  by  using  the  combiner  strat¬ 
egy  to  coalesce  the  predicted  classifications  of 
instances  in  disagreement  by  the  base  classi¬ 
fiers,  instead  of  purely  arbitrating  among  them. 
A  learning  algorithm  then  generates  a  meta- 
classiher  from  this  training  set  When  a  test 
instance  is  classified,  the  base  classifiers  first 
generate  their  predictions.  These  predictions 
are  then  composed  to  form  a  meta-level  instance 
for  the  learned  meta-classifier  using  the  same 
composition  rule.  The  meta-classifier  then  pro¬ 
duces  the  final  prediction. 

We  experimented  with  two  combinations  of 
composition  and  selection  rules,  though  any 
combination  of  the  rules  is  possible: 

1 .  Select  examples  that  have  different  predic¬ 
tions  from  the  base  classifiers  and  the  pre¬ 
dictions,  together  with  the  correct  classes 
and  attribute  vectors  from  the  training 
set  for  the  meta-leamer.  This  integrates 
the  meta-different  and  meta-class-attribute 
schemes.  (Henceforth,  we  refer  to  this 
scheme  as  meta-different-class-attribute.) 

2.  Select  examples  that  have  different  or  in¬ 
correct  predictions  from  the  base  classi¬ 
fiers  and  the  predictions,  together  with  the 
correct  classes  and  attribute  vectors  form 
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the  training  set  for  the  meta-leamer.  This 
integrates  the  meta-different-incorrect  and 
meta-class-attribute  schemes.  (Hence¬ 
forth,  denoted  as  meta-different-incorrect- 
class-attribute.) 

We  discussed  three  general  meta-leaming 
strategies  (combiner,  arbiter,  and  hybrid)  for 
multistrategy  hypothesis  boosting.  The  com¬ 
biner  strategy  aims  at  coalescing  predictions 
from  the  constituent  classifiers,  whereas  the  ar¬ 
biter  strategy  arbitrates  among  them.  In  addi¬ 
tion,  the  training  set  for  the  combiner  strategy 
includes  examples  derived  from  the  entire  orig¬ 
inal  training  set,  whereas  the  one  for  the  arbiter 
includes  only  examples  chosen  by  a  selection 
rule  from  the  original  set.  That  is,  the  train¬ 
ing  set  for  the  arbiter  strategy  is  usually  smaller 
than  the  one  for  the  combiner  strategy  and  hence 
contains  less  information.  The  hybrid  strategy 
is  intended  to  augment  this  deficiency  in  the 
arbiter  strategy  by  coalescing  the  predictions 
from  the  selected  examples.  We  postulate  that 
the  combiner  strategy  would  still  be  the  most 
effective  one  due  to  the  larger  amount  of  infor¬ 
mation  available  and  the  coalescing  process. 

Among  the  combiner  schemes,  the  meta-class- 
attribute  scheme  provides  more  information 
(the  addition  of  attribute  vectors)  than  the  meta¬ 
class  scheme  in  the  combiner  training  set 
The  meta-class-binary  scheme  provides  more 
precise  information  for  the  meta-leamer  be¬ 
cause  more  specialized  base  classifiers  are  used. 
Among  the  arbiter  schemes,  the  meta-different- 
incorrect  scheme  includes  more  examples  in 
the  arbiter  training  set  than  the  meta-different 
scheme.  Similar  observations  can  be  made  for 
corresponding  schemes  in  the  hybrid  strategy. 

The  choice  of  the  meta-leamer  to  perform  the 
above  strategies  is  another  issue.  Due  to  the 
relatively  low  regularity  in  the  training  data  for 
meta-leamers,  we  postulate  that  a  probabilistic 
learner  would  be  more  effective  than  a  categor¬ 


ical  one. 

3.4  Summary  of  empirical  results 

Experiments  on  the  aforementioned  strategies 
were  run  with  different  combinations  of  three 
base-learners  and  one  meta-leamer.  In  the  ex¬ 
periments  we  employed  four  different  learning 
algorithms:  BAYES  (described  in  (Qark  & 
Niblett,  1987)),  IDS  (Quinlan,  1986),  CART 
(Breiman  et  al.,  1984,  and  WPEBLS  (the 
weighted  version  of  PEBLS  (Cost  &  Salzberg, 
1993)),  and  two  molecular  biology  data  sets: 
protein  secondary  structures  (SS)  (Qian  &  Se- 
jnowski,  1988)  and  DNA  splice  junctions  (SJ) 
(Towelletal.,  1990).  Details  of  the  experiments 
and  quantitative  results  obtained  are  reported  in 
(Chan  &  Stolfo,  1993a).  Space  limitations  pre¬ 
vent  us  from  displaying  them  here.  We  summa¬ 
rize  our  results  as  follows. 

There  are  two  ways  to  analyze  the  results.  First, 
we  consider  whether  the  employment  of  a  meta- 
leamer  improves  accuracy  with  respect  to  the 
underlying  three  base  classifiers.  For  both  sets 
of  data,  we  discovered  improvements  were  al¬ 
ways  achieved  when  BAYES  was  used  as  the 
meta-leamer  and  the  other  three  learning  algo¬ 
rithms  (ID3,  CART,  and  WPEBLS)  served  as 
the  base-learners  (improved  from  55.4%  up  to 
60.7%  in  SS  and  from  94.8%  up  to  96.6%  in 
SJ),  regardless  of  the  meta-leaming  strategies 
employed.  Next,  when  we  considered  combi¬ 
nations  of  a  particular  meta-leamer  and  strat¬ 
egy,  regardless  of  the  base  learners,  the  re¬ 
sults  were  mixed.  For  the  SJ  data  set,  the 
same  or  better  accuracy  was  consistently  at¬ 
tained  when  BAYES  was  the  meta-leamer  in  the 
meta-class-attribute  strategy  (improved  from 
94.8%  up  to  97.2%)  and  ID3  in  the  meta-class 
and  meta-class-attribute  strategies  (improved 
from  94.8%  up  to  96.9%),  regardless  of  the 
base  learners.  For  the  SS  data  set,  none  of  the 
meta-leamer/strategy  combinations  maintained 
a  consistent  improvement. 
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Second,  we  consider  whether  the  use  of  meta- 
learning  achieves  higher  accuracy  than  the 
most  accurate  single- strategy  learner  (which 
was  BAYES  in  this  case).  For  the  SJ  data  set, 
improvement  was  consistently  achieved  when 
BAYES  is  the  meta-leamer  in  the  meta-class- 
attribute  strategy  (improved  from  96.4%  up  to 
97.6%),  regardless  of  the  base  learners.  In  fact, 
when  the  base  learners  are  BAYES,  ID3,  and 
CART,  the  overall  accuracy  was  the  highest  ob¬ 
tained.  For  the  SS  data  set,  almost  all  the  results 
did  not  outperform  BAYES  as  a  single-strategy 
learner. 

The  two  data  sets  chosen  represent  two  differ¬ 
ent  kinds  of  data  sets:  SS  is  difficult  to  learn 
(50+%  accuracy)  and  SJ  is  easy  to  learn  (90+% 
accuracy).  Our  experiments  indicate  that  some 
of  our  meta-leaming  strategics  improve  accu¬ 
racy  in  the  SJ  data  set.  However,  in  the  SS 
data  set,  meta-leaming  did  not  improve  accu¬ 
racy.  This  can  be  attributed  to  the  quality  of 
predictions  from  the  base  classifiers  for  the  two 
data  sets.  The  high  percentage  of  having  one 
or  none  correct  out  of  three  predictions  in  the 
SS  data  set  might  greatly  hinder  the  ability  of 
meta-leaming  to  work.  One  possible  solution 
is  to  increase  the  number  of  base  classifiers  to 
lower  the  percentage  of  having  one  or  none  cor¬ 
rect  predictions. 

In  general,  the  combiner  strategies  performed 
more  effectively  than  the  arbiter  and  hybrid 
strategies  in  the  test  cases  studied.  To  our  sur¬ 
prise,  the  hybrid  schemes  did  not  improve  the 
arbiter  strategies.  This  indicates  that  coalescing 
the  predictions  are  more  beneficial  than  arbitrat¬ 
ing  among  them.  Among  the  combiner  strate¬ 
gies,  meta-class-attribute  was  particularly  ef¬ 
fectively.  This  suggests  that  predictions  alone 
are  not  sufficient  for  meta-leaming.  Further¬ 
more,  BAYES  was  usually  the  more  successful 
meta-leamer,  which  coincides  with  our  earlier 
intuition  that  probablistic  meta-Ieamers  might 
be  more  effective  than  others. 


4  Multistrategy  Parallel  Learning 

The  du.:  -  objectives  of  the  multistrategy  paral¬ 
lel  learning  (MSPL)  framework  are  to  improve 
accuracy  using  multiple  algorithms  and  to  speed 
up  the  learning  process  by  parallel  processing  in 
idivide-and-conquer  fdstaon.  Since  the  MSPL 
framework  is  an  integration  of  the  multistrategy 
hypothesis  boosting  (presented  in  the  previous 
section)  and  the  parallel  learning  (PL)  frame¬ 
works,  we  briefly  review  the  PL  framework  in¬ 
troduced  in  (Chan  &  Stolfo,  1993c)  before  we 
discuss  the  MSPL  framework. 

4.1  Parallel  learning 

The  objective  here  is  to  speed  up  the  learning 
process  by  divide-and-conquer.  The  data  set  is 
partitioned  into  subsets  and  the  same  learning 
algorithm  is  applied  on  each  of  these  subsets. 
Several  issues  arise  here. 

First,  how  many  subsets  should  be  generated? 
This  largely  depends  on  the  number  of  proces¬ 
sors  available  and  the  size  of  the  training  set. 
The  number  of  processors  puts  an  upper  bound 
on  the  number  of  subsets.  Another  considera¬ 
tion  is  the  desired  accuracy  we  wish  to  achieve; 
there  may  be  a  tradeoff  between  the  number  of 
subsets  and  the  final  accuracy.  Moreover,  the 
size  of  each  subset  cannot  be  too  small  because 
sufficient  data  must  be  available  for  each  learn¬ 
ing  process  to  produce  an  effective  classifier. 

Second,  what  is  the  distribution  of  training  ex¬ 
amples  in  the  subsets  ?  The  subsets  can  be  dis¬ 
joint  or  overlap.  The  class  distribution  can  be 
random,  or  follow  some  deterministic  scheme. 
We  experimented  with  disjoint  equal-size  sub¬ 
sets  with  random  distributions  of  classes. 

Third,  how  do  we  apply  meta-leaming  to  coa¬ 
lescing  partial  results  generated  by  the  learning 
processes?  This  is  the  more  important  ques- 
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tion.  Our  approach  is  meta-leaming  arbiters  in 
a  binary-tree  fashion. 

Based  upon  a  number  of  candidate  predictions, 
an  arbiter,  together  with  an  arbitration  rule, 
decides  a  final  outcome  (similar  to  the  arbiter 
strategy  described  in  Section  3.2).  An  arbiter 
is  learned  from  the  output  of  a  pair  of  learning 
processes  and  recursively,  an  arbiter  is  learned 
from  the  output  of  two  arbiters.  A  binary  tree 
of  arbiters  (called  an  arbiter  tree)  is  generated 
with  the  initially  learned  classifiers  at  the  leaves. 
For  5  subsets  and  .s  classifiers,  there  are  log2{s) 
levels  in  the  generated  arbiter  tree.  The  manner 
in  which  an  arbiter  tree  is  computed  and  used  is 
the  subject  of  the  following  sections. 

4.2  Classifying  using  an  arbiter  tree 

When  an  instance  is  classified  by  the  arbiter 
tree,  predictions  flow  from  the  leaves  to  the 
root.  First,  each  of  the  leaf  classifiers  produces 
an  initial  prediction;  i.e.,  a  classification  of  the 
test  instance.  From  a  pair  of  predictions  and  the 
parent  arbiter’s  prediction,  a  combined  predic¬ 
tion  is  produced  by  an  arbitration  rule.  This 
process  is  applied  at  each  level  until  a  final 
prediction  is  produced  at  the  root  of  the  tree. 
Detailed  schemes  for  the  arbitration  rule  are  re¬ 
ported  in  (Chan  &  Stolfo,  1993c)  and  similar 
ones  can  be  found  in  Section  3.2.  Since  at  each 
level,  the  leaf  classifiers  and  arbiters  are  inde¬ 
pendent,  predictions  are  generated  in  parallel. 
Further  issues  and  strategies  for  efficiently  gen¬ 
erating  predictions  by  arbiter  trees  are  beyond 
the  scope  of  this  paper.  Next,  we  describe  how 
an  arbiter  tree  is  learned. 

43  Meta-learning  an  arbiter  tree 

We  experimented  with  several  schemes  to  meta- 
leam  a  binary  tree  of  arbiters.  In  all  these 
schemes  the  leaf  classifiers  are  first  learned 
from  randomly  chosen  disjoint  data  subsets  and 


the  classifiers  are  grouped  in  pairs.  (The  strat¬ 
egy  for  pairing  classifiers  is  the  subject  of  fu¬ 
ture  study  and  is  discussed  later.)  For  each 
pair  of  classifiers,  the  union  of  the  data  subsets 
on  which  the  classifiers  are  trained  is  gener¬ 
ated.  This  union  set  is  then  classified  by  the 
two  classifiers.  A  selection  rule  compares  the 
predictions  from  the  two  classifiers  and  selects 
instances  from  the  union  set,  which  form  the 
training  set  for  the  arbiter  of  the  pair  of  clas¬ 
sifiers.  Thus,  the  rule  acts  as  a  data  filter  to 
produce  a  training  set  with  a  particular  distribu¬ 
tion.  Detailed  strategies  for  the  selection  rule 
are  reported  in  (Chan  &  Stolfo,  1993c)  and  sim¬ 
ilar  schemes  can  be  found  in  Section  3.2.  The 
arbiter  is  learned  from  this  set  with  the  same 
learning  algorithm.  The  process  of  forming 
the  union  of  data  subsets,  classifying  it  using 
a  pair  of  arbiter  trees,  comparing  the  predic¬ 
tions,  forming  a  training  set,  and  training  the 
arbiter  is  recursively  performed  until  the  root 
arbiter  is  formed. 

For  example,  suppose  there  are  initially  four 
training  data  subsets  (Ti  -  Ta).  First,  four  clas¬ 
sifiers  (Cl  —  Ca)  are  generated  in  parallel  from 
Ti  —  Ta.  The  union  of  subsets  7i  and  T2,  Vn,  is 
then  classified  by  Ci  and  Cz,  which  generates 
two  sets  of  predictions  (Pi  and  P2).  Based  on 
predictions  Pi  and  P2,  and  the  subset  Un,  a  se¬ 
lection  rule  generates  a  training  set  (T^)  for  the 
arbiter.  The  arbiter  (i4i2)  is  then  trained  from  the 
set  T\z  using  the  same  learning  algorithm  used 
to  learn  the  initial  classifiers.  Similarly,  arbiter 
/I34  is  generated  in  the  same  fashion  starting 
from  T-i  and  Ta,  in  parallel  with  An,  and  hence 
ail  the  first-level  arbiters  are  produced.  Then 
U\A  is  formed  by  the  union  of  subset  T\  through 
T4  and  is  classified  by  the  arbiter  trees  rooted 
with  An  and  A34.  Similarly,  Tia  and  Au  (root 
arbiter)  are  generated  and  the  arbiter  tree  is  com¬ 
pleted  (see  Figure  2). 
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Figure  2:  Sample  arbiter  tree 


43.1  Summary  of  empirical  results 

Experiments  on  the  aforementioned  strategies 
were  run  on  the  four  different  learning  algo¬ 
rithms  and  two  data  sets  described  in  the  Sec¬ 
tion  3.4.  Details  of  the  experiments  and  prelim¬ 
inary  results  obtained  from  a  serial  implemen¬ 
tation  are  reported  in  (Chan  &  Stolfo,  1993c). 
Here  we  summarize  those  results. 

In  one  set  of  experiments,  we  restricted  the 
training  set  size  for  an  arbiter  to  be  no  larger 
than  the  training  set  size  for  a  leaf  classifier. 
Hence,  the  amount  of  computation  in  training 
an  arbiter  is  bounded  by  the  time  to  train  a  leaf 
classifier.  In  a  parallel  computation  model  each 
level  of  arbiters  can  be  learned  as  efficiently  as 
the  leaf  classifiers,  and  hence  significant  speed 
up  can  be  predicted.  Our  arbiter  schemes  main¬ 
tained  the  low  accuracy  in  the  SS  data  set  and 
degraded  (up  to  10%  with  32  subsets)  the  high 
accuracy  (90+%)  in  the  SJ  data  set  The  ac¬ 
curacy  drop  in  the  SJ  data  can  be  attributed  the 
absence  of  a  particular  class  in  half  of  the  arbiter 
tree,  resulting  from  a  random  class  distribution 
in  the  subsets.  Another  reason  might  be  the 
small  size  of  each  subset  which  has  about  80 
examples. 

When  the  restriction  on  the  size  of  the  training 
set  for  an  arbiter  is  lifted,  the  same  level  of  ac¬ 
curacy  can  be  achieved.  For  the  SJ  data  set 
empirical  results  show  that  the  single  largest 
arbiter  training  set  is  about  30%  of  the  en¬ 


tire  training  set  Moreover,  up  to  a  six-fold 
speed  up  with  eight  subsets/processors  was  still 
achieved  based  on  a  the^^retical  calculation  us¬ 
ing  the  O(n^)  model  oi  WPEBLS.  (Not  much 
improvement  can  be  obtained  in  linear  algo¬ 
rithms  like  BAYES.)  That  is,  less  time  and 
memory  than  a  serial  version  is  needed  to  reach 
the  same  results.  However,  the  amount  of  speed 
up  leveled  off  after  eight  subsets  because  the 
largest  arbiter  training  set  appeared  at  the  root, 
which  formed  a  bottleneck.  Since  the  pairing 
of  classifiers/arbiters  affects  the  arbiter  training 
set  sizes,  we  are  currently  investigating  pairing 
strategies  to  reduce  the  size  of  the  largest  set 
One  scheme  is  to  pair  classifiers  and  arbiters 
that  agree  most  often  with  each  other.  Another 
scheme  is  to  pair  those  that  disagree  the  most 
At  first  glance  the  first  scheme  would  seem  to 
be  more  attractive.  However,  since  disagree¬ 
ments  are  present  if  they  do  not  get  resolved 
at  the  bottom  of  the  tree,  they  will  all  surface 
near  the  root  of  the  tree,  which  is  also  when 
the  choice  of  pairings  is  limited  or  nonexistent 
(there  are  only  two  arbiters  one  level  below  the 
root).  Hence,  it  might  be  more  beneficial  to 
resolve  conflicts  near  the  leaves  leaving  fewer 
disagreements  near  the  root.  Empirical  results 
indeed  show  that  pairing  the  classifiers  that  pro¬ 
duced  the  larger  sets  (more  disagreements)  at 
the  leaf  level  reduced  the  size  of  the  largest  set 
in  the  tree. 

4.4  Approaches  to  MSPL 

Recall,  we  seek  to  both  improve  accuracy  and 
speed  up  the  computation.  There  are  three  gen¬ 
eral  approaches  in  achieving  these  goals  that 
differ  in  the  amount  of  data  used  and  the  man¬ 
ner  in  which  the  learning  algorithms  are  applied. 
For  pedagogical  reasons,  the  following  discus¬ 
sion  will  use  the  combiner  approach  as  the  de¬ 
fault  scheme  for  combining  results  at  the  meta¬ 
level.  Here  we  present  the  approaches  currently 
under  development  and  some  empirical  results 
obtained  for  the  second  approach. 
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4.4.1  Coarse-grain  diveraty  with  the 
entire  training  set 

In  this  approach  each  learning  algorithm  is  ap¬ 
plied  to  the  entire  training  data  set  The  learn¬ 
ers  are  run  concurrently  and  each  follows  the 
PL  framework.  As  a  result  each  learner  pro¬ 
duces  an  arbiter  tree.  The  predictions  of  the 
arbiter  trees  on  the  training  set  become  part  of 
the  training  examples  for  the  combiner  in  the 
MSHB  framework.  Since  the  training  of  the 
combiner  has  to  be  performed  after  the  arbiter 
trees  are  formed,  all  the  processors  used  to  train 
the  arbiter  trees  will  be  available  for  training 
the  combiner.  Again,  the  combiner  is  trained 
using  the  PL  framework,  which  generates  an¬ 
other  arbiter  tree.  During  the  classification  of 
an  instance,  predictions  are  generated  from  the 
learned  arbiter  trees  and  are  coalesced  by  the 
learned  combiner  (an  arbiter  tree  itself).  For 
further  reference,  this  approach  is  denoted  as 
coarse-all.  Figure  3(a)  schematically  depicts 
the  arbiter  trees  formed.  Each  triangle  repre¬ 
sents  a  learned  arbiter  tree.  I3  are  the 

different  learners  (three  of  them  in  this  case) 
and  Lc  is  the  learner  for  the  combiner.  E  is 
the  original  set  of  training  examples.  Since  this 
approach  is  similar  to  the  MSHB  framework, 
except  the  use  of  arbiter  trees,  accuracy  results 
similar  to  those  presented  in  Section  3.4  can  be 
expected  here. 

Since  different  learning  algorithms  have  differ¬ 
ent  complexity,  different  learners  will  finish  the 
same  task  at  different  times.  That  is,  it  is  im¬ 
portant  to  determine  how  many  processors  are 
allocated  to  each  learner  to  balance  the  com¬ 
putational  load  and  reduce  any  large  variance 
in  completion  times.  To  define  and  implement 
a  scheme  to  allocate  processors  and  data,  rela¬ 
tive  speeds  of  the  learners  have  to  be  measured. 
One  approach  is  to  determine  the  speed  of  a 
learner  empirically  relative  to  the  data  set  size 
and  derive  a  function  to  approximate  the  rela¬ 
tionship  between  speed  and  data  set  size.  Rel¬ 


ative  speeds  of  the  learners  are  then  estimated 
by  these  functions.  Given  the  relative  speeds, 
we  then  allocate  processors  to  each  learner  to 
achieve  load  balancing. 

4.4  J.  Coarse-grain  diversity  with  data 
subsets 

This  approach  is  similar  to  the  previous  one  ex¬ 
cept  that  each  learner  is  applied  concurrently  to 
a  different  data  subset,  rather  than  the  entire  data 
set  That  is,  the  data  set  is  divided  into  I  subsets, 
where  /  is  the  number  of  learning  algorithms 
available,  and  a  different  learning  algorithm  is 
applied  to  each  subset  The  PL  framework  is 
then  applied  to  each  algorithm-subset  pair.  As 
a  result,  I  arbiter  trees  will  be  formed.  The  pre¬ 
dictions  of  the  arbiter  trees  on  the  training  set 
become  part  of  the  training  examples  for  the 
combiner  in  the  MSHB  framework.  Similar  to 
the  previous  approach,  the  combiner  is  gener¬ 
ated  as  another  arbiter  tree.  When  an  instance  is 
classified,  the  learned  arbiter  trees  first  produce 
their  predictions,  which  are  then  coalesced  by 
the  learned  combiner.  This  approach  is  denoted 
as  coarse-subset.  Figure  3(b)  schematically  de¬ 
picts  the  arbiter  trees  formed.  L\ ,  L2,  L-i  are  the 
different  learners  and  Lc  is  the  learner  for  the 
combiner.  £'1,  £2,  Es  are  subsets  of  the  original 
set  of  training  examples. 

As  in  the  previous  approach,  load  balancing 
is  essential  in  minimizing  the  overall  training 
time.  However,  in  this  approach  we  have  to 
determine  how  to  allocate  the  data  subsets  as 
well  as  the  processors.  One  approach  is  to 
evenly  distribute  the  data  among  the  learners 
and  allocate  processors  according  to  their  rel¬ 
ative  speeds.  Another  approach  is  that  each 
learner  has  the  same  number  of  processors  and 
data  are  distributed  accordingly.  That  is,  we 
have  to  decide  whether  we  allocate  a  uniform 
number  of  processors  or  a  uniform  amount  of 
data  to  each  learner.  Since  the  amount  of  data 
affects  the  quality  of  the  learned  concepts,  it 
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Figure  3:  Learned  arbiter  trees  from  the  MSPL  approaches  with  three  different  learners 


is  more  desirable  to  evenly  distribute  the  data 
so  that  the  learners  are  not  biased  at  this  stage. 
That  is,  slower  learners  should  not  be  penalized 
with  less  information  and  thus  they  should  be 
allocated  more  processors. 

This  raises  the  question  of  whether  data  should 
be  distributed  at  all;  that  is,  should  each  learner 
have  all  the  data,  as  in  the  previous  approach? 
Obviously,  if  each  learner  has  the  entire  set  of 
data,  it  would  be  slower  than  when  it  has  only 
a  subset  of  the  data.  It  is  also  clear  that  the 
more  data  each  learner  has,  the  more  accurate 
the  generated  concepts  wiU  be.  That  is,  it  is  a 
tradeoff  between  speed  and  quality.  But  in  prob¬ 
lems  with  very  large  databases,  we  may  have  no 
choice  but  to  distribute  subsets  of  the  data.  An¬ 
other  question  is  what  the  data  distribution  is 
for  the  data  subsets.  The  subsets  can  be  dis¬ 
joint  or  overlapped  according  to  some  scheme. 
We  prefer  disjoint  subsets  because  it  allows  the 
maximum  degree  of  parallelism.  The  classes 
represented  in  the  subsets  can  be  distributed  ran¬ 
domly,  uniformly,  or  according  to  some  scheme. 
Since  maintaining  the  same  class  distribution  in 
each  subset  as  in  the  entire  set  does  not  cre¬ 
ate  the  potential  problem  of  missing  classes  in 
certain  subsets,  it  is  our  preferred  distribution 
scheme. 


Summary  of  empirical  results  Experiments 
on  the  coarse-subset  approach  with  the  strate¬ 
gies  discussed  in  Section  3  were  run  on  the  four 
different  learning  algorithms  and  two  molec¬ 
ular  biology  data  sets  described  in  the  Sec¬ 
tion  3.4.  Details  of  the  experiments  and  prelim¬ 
inary  quantitative  results  ../btained  from  a  serial 
implementation  without  the  use  of  arbiter  trees 
are  reported  in  (Chan  &.  Stolfo,  1993b).  We 
summarize  those  results  as  follows. 

There  are  three  ways  to  analyze  the  results. 
First,  we  look  at  whether  the  employment  of 
a  meta-leamer  improves  accuracy  with  respect 
to  the  underlying  three  base  classifiers  learned 
on  a  subset  For  the  S  J  data,  improvements  were 
almost  always  achieved  when  the  combinations 
of  base  learners  are  ID3-CART-WPEBLS  (im¬ 
proved  from  94.1%  up  to  96.2%)  and  BAYES- 
CART-WPEBLS  r-om  95.7%  up  to  97.2%), 
regardless  of  the  meta-leamers  and  strategies. 
For  the  SS  data,  when  the  combination  of  base- 
learners  is  ID3-CART-WPEBLS,  more  than 
half  of  the  meta-leamer/strategy  combinations 
achieved  higher  accuracy  than  any  of  the  base 
learners  (from  53.9%  up  to  96.2%). 

Second,  we  look  at  whether  the  use  of  meta- 
leaming  achieves  higher  accuracy  than  the 
most  accurate  classifier  learned  from  a  subset 


(BAYES  in  this  case).  For  the  SJ  data,  the 
meta-class-attribute  strategy  with  BAYES  as 
the  meta-leamer  always  attained  higher  accu¬ 
racy  (from  95.7%  up  to  97.2%),  regardless  of 
the  base  learners  and  strategies.  For  the  SS  data, 
all  the  results  did  not  outperform  BAYES  as  a 
single  base  learner. 

Third,  we  look  at  whether  the  use  of  meta- 
leaming  achieves  higher  accuracy  than  the  most 
accurate  classifier  learned  from  the  full  train¬ 
ing  set  (BAYES  in  this  case).  For  the  SJ  data, 
meta-class-attribute  strategy  with  BAYES  as 
the  meta-learner  almost  always  attained  higher 
accuracy  (from  96.4%  up  to  97.2%),  regardless 
of  the  base  learners  and  strategies.  For  the  SS 
data,  all  the  results  did  not  outperform  BAYES. 

As  in  the  results  from  the  MSHB  framework 
(Section  3.4),  meta-class-attribute  is  the  more 
effective  scheme  and  BAYES  is  the  more  suc¬ 
cessful  meta-leamer.  Therefore,  it  reinforces 
our  conjecture  that  coalescing  results  are  more 
effective  than  arbitrating  among  them  and  pre¬ 
dictions  alone  are  not  enough  for  meta-leaming. 
Compared  to  results  obtained  from  the  MSHB 
framework,  smaller  improvements  were  ob¬ 
served  here.  This  is  mainly  due  to  the  smaller 
amount  of  information  presented  to  the  base 
learners. 

4.43  Fine-grain  diversity 

In  this  approach  the  data  set  is  divided  into  p 
subsets,  where  p  is  the  number  of  processors 
available,  and  a  different  learning  algorithm  is 
applied  to  each  subset.  The  subsets  are  paired 
and  an  arbiter  tree  is  formed  in  a  similar  fashion 
as  for  the  single- strategy  arbiter  tree.  That  is, 
instead  of  using  the  same  algorithm  for  train¬ 
ing  an  arbiter  and  its  two  children  as  in  the  PL 
framework,  different  algorithms  are  used.  This 
also  results  in  generating  only  one  arbiter  tree, 
which  contrasts  with  generating  multiple  arbiter 
trees  in  the  previous  two  approaches.  When 


an  instance  is  classified,  the  prediction  of  the 
learned  arbiter  tree  is  the  final  prediction.  This 
approach  is  denoted  as  fine-grain.  Figure  3(c) 
schematically  depicts  the  arbiter  tree  formed. 
Li,  Z/2,  L3  are  the  different  learners.  E  is  the 
original  set  of  training  examples. 

Since  each  subset  is  allocated  to  one  processor 
and  different  algorithms  have  different  speeds, 
the  size  of  the  subsets  needs  to  be  adjusted  ac¬ 
cordingly  to  achieve  load  balancing.  That  is, 
the  size  of  the  p  subsets  is  determined  by  the 
relative  speeds  of  different  algorithms  applied 
to  the  subsets.  Another  issue  we  consider  is 
how  the  learning  algorithms  are  allocated  to  the 
p  subsets  and  the  subsequent  training  sets  for 
the  arbiters.  They  could  be  allocated  uniformly, 
randomly,  or  according  to  some  scheme. 

4.4.4  Discussion 

In  terms  of  learning  speed,  the  coarse-grain 
approaches  are  slower  than  the  fine-grain  ap¬ 
proach  because  the  combiner  has  to  be  learned 
after  all  the  base  arbiter  trees  are  learned;  how¬ 
ever,  only  one  arbiter  tree  is  learned  in  the  fine- 
grain  approach.  In  terms  of  overall  predic¬ 
tion  accuracy,  the  coarse-all  approach  should 
be  more  accurate  than  the  coarse-subset  ap¬ 
proach  because  of  the  larger  amount  of  infor¬ 
mation  available  to  each  learner  in  the  first  ap¬ 
proach.  (Recall  that  each  arbiter  tree  in  coarse- 
all  is  trained  on  the  entire  training  set,  which 
contrasts  to  a  subset  in  coarse-subset.)  It  is  un¬ 
clear  at  this  point  how  the  fine-grain  approach 
will  perform  compared  to  the  other  two  and  is 
the  subject  of  further  experimentation.  Further¬ 
more,  the  coarse-grain  approaches  use  a  com¬ 
biner  to  coalesce  the  arbiter  trees,  whereas  the 
fine-grain  approach  does  not.  However,  more 
diversity  is  present  in  the  fine-grain  approach. 

In  our  experments  for  the  MSHB  framework, 
one  of  the  learning  algorithms  (BAYES)  con¬ 
sistently  performs  better  as  a  meta-leamer  than 
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the  other  algorithms.  (Recall,  meta-leamers 
learn  from  the  output  of  other  learners  and  base 
learners  learn  from  the  initial  raw  data.)  Since 
multiple  learning  algorithms  are  available  in  the 
three  approaches,  it  might  be  beneficial  to  use 
the  same  most  effective  learning  algorithm  as 
the  meta-leamer.  That  is,  different  learners  are 
used  at  the  leaves,  but  the  same  learner  is  used 
for  the  rest  of  the  tree. 

Moreover,  we  presently  concentrate  on  improv¬ 
ing  learning  speed  via  parallelism  and  predic¬ 
tion  accuracy  via  diversity.  However,  the  ar¬ 
biter  tree  concept  for  parallel  learning  can  be 
extended  to  improve  accuracy  as  well.  Recall 
that  the  arbiter  tree  has  multiple  levels  of  ar¬ 
biters  and  the  training  set  for  an  arbiter  is  de¬ 
rived  from  the  arbiter  subtrees.  Currently,  this 
training  set  consists  of  information  derived  from 
the  original  training  data  and/or  the  predictions 
from  the  arbiter  subtrees.  However,  we  can  in¬ 
corporate  features  (combinations  of  attributes) 
used  in  the  arbiter  subtrees  into  this  set.  In  other 
words,  at  each  level  of  arbiters,  features  are  con¬ 
structed  from  lower-level  ones  and  are  added  to 
the  training  sets  for  the  arbiters  (i.e.,  performing 
constructive  induction  between  levels).  This 
way,  more  effective  features  would  be  in  an  ar¬ 
biter’s  training  set  and  hence  a  more  accurate 
arbiter  might  be  learned.  Unfortunately,  feature 
construction,  requires  non-negligible  computa¬ 
tion  overhead,  especially  when  large  number 
of  records  are  concerned,  since  a  value  has  to 
be  calculated  for  each  new  feature  and  record. 
However,  this  might  be  compensated  for  by  re¬ 
duced  search  time  for  some  of  the  learning  algo¬ 
rithms  because  of  the  presence  of  more  useful 
features. 

5  Concluding  Remarks 

The  frameworks  presented  demonstrates  that 
meta-leaming  can  be  used  as  a  unified  approach 
to  facilitate  multi  strategy  and  parallel  learning. 


Various  meta-leaming  strategies  we  have  ex¬ 
plored  are  also  independent  of  the  learning  algo¬ 
rithms  or  paraUeVdistributed  environment  used. 
Preliminary  empirical  results  suggest  that  cer¬ 
tain  strategies  and  meta-leamers  are  more  ef¬ 
fective  than  others.  Our  results  are  preliminary 
a-.d  more  experiments  are  being  performed  to 
ensure  that  the  results  we  have  achieved  to  date 
are  indeed  statistically  significant,  and  to  study 
how  meta-leaming  scales  with  much  larger  data 
sets.  The  strategies  discussed  here  are  by  no 
means  final.  We  intend  to  refine  our  current 
strategies  and  explore  others  as  our  experiments 
progress. 
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Abstract 

This  paper  describes  a  new  approach  to 
conceptual  clustering  called,  clustering  for 
single  numeric  attribute  prediction  (CSNAP). 
The  events  provided  to  the  system  have  one 
attribute  that  is  to  be  predicted  using 
descriptions  of  the  other  attributes.  The  CSNAP 
system  addresses  the  problem  of  constructing 
clusters  of  points  that  have  a  probabilistic  value. 
In  other  words,  the  same  state  description  does 
not  necessarily  have  the  same  value  for  one  of 
the  key  attributes.  This  paper  describes  how 
statistical  methods  are  combined  with  machine 
learning  techniques  in  order  to  build  useful 
classifications  of  points. 

Keywords:  machine  learning,  multistrategy 
learning,  conceptual  clustering,  prediction, 
statistics 

1.  Introduction 

People  have  an  incredible  skill  of  observing  a 
few  instances  of  a  situation  and  being  able  to 
learn  from  them.  What  is  learned  is  not  100% 
accurate,  but  people  still  use  the  knowledge  and 
develop  a  sense  of  its  reliability.  A  simple 
example  of  such  a  task  is  learning  when  the  lines 
are  long  at  the  bank.  After  just  a  few  visits, 
people  learn  that  going  to  the  bank  on  Friday 
afternoons  and  before  a  holiday  is  to  be  avoided 


because  of  long  lines.  When  the  bank  patron 
visits  the  bank,  lines  are  never  exactly  the  same 
length,  and  there  have  been  a  few  times  on 
Friday  afternoons  when  the  lines  were  short. 
Yet,  the  general  concept  is  learned  and  these 
rules  about  the  bank  are  used  with  confidence. 
Problems  like  this  occur  everywhere:  predicting 
the  traffic  rate  at  a  certain  time  of  day, 
determining  what  time  of  month  the  factory  will 
be  busy,  etc. 

While  people  are  good  at  understanding  and 
learning  about  fairly  repetitive  events,  machine 
learning  systems  are  not.  The  goal  of  this 
research  is  to  produce  a  learning  system  that  can 
understand  such  problems.  In  the  process  of 
learning  to  make  accurate  predictions  there  are 
two  additional  objectives: 

1)  The  confidence  in  the  prediction  made 
for  such  problems  can  be  stated  along 
with  the  prediction. 

2)  The  rules  used  to  make  the  prediction 
are  understandable  to  the  humans  using 
the  system. 
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Statistical  measures  are  able  to  provide  the  first 
capability.  Symbolic  learning  systems  are  able 
to  provide  the  second.  This  work  integrates  the 
strengths  of  each  approach  in  order  to  provide 
capabilities  not  found  in  other  learning  systems. 

The  next  section  of  the  paper  describes  the 
motivation  for  developing  the  CSNAP  system 
with  special  focus  on  why  statistical  and 
symbolic  learning  methods  needed  to  be 
combined.  The  third  section  describes  the 
CSNAP  system  and  the  fourth  section  presents 
results  of  an  application  where  CSNAP  has  been 
used  to  predict  building  traffic  patterns.  The 
paper  concludes  with  some  final  remarks  on  the 
system. 

2.  Motivation 

The  CSNAP  system  was  developed  for 
situations  with  the  following  criteria: 

♦  A  numeric  (real  valued)  attribute  is  to  be 
predicted  (dependent  variable). 

♦  Two  events  with  identical  attribute  values 
can  have  different  dependent  variable 
values. 

♦  The  attributes  used  to  describe  an  event  need 
not  be  independent. 

♦  Accurate  predictions  for  unseen  events  are 
required. 

♦  Meaningful  descriptions  of  event  classes  are 
produced. 

♦  The  knowledge  produced  from  data  should 
be  (easily)  extendible  with  additional 
human  knowledge. 


Many  problems  are  defined  by  the  criteria 
identified  above.  These  problems  define  events 
that  have  a  dependent  variable  tiiat  is  influenced 
by  the  world  in  non-deterministic  ways. 
Similar  event  descriptions  do  not  guarantee  that 
the  dependent  variable  value  will  be  identical. 
In  order  to  gain  knowledge  about  the  world 
there  must  exist  some  regularity  in  those  events. 
Statistics  can  capture  this  type  of  regularity  by 
describing  the  mean  and  variance  of  the 
dependent  variable  values  for  a  set  of  events. 
The  mean  value  provides  a  “best  guess”  for 
events  with  that  state  description  and  the 
variance  indicates  the  confidence  or  “goodness” 
of  the  guess.  The  goal  of  the  CSNAP  system  is 
to  cluster  events  to  minimize  the  variance  of  the 
events  within  a  class,  thus  improving  the 
confidence  of  the  predictions  made  using  the 
classification. 

Even  the  best  learning  system  is  limited  by  the 
data  it  is  given.  The  learning  system  can  capture 
the  knowledge  that  is  in  the  data,  but  not  all 
relevant  knowledge  is  necessarily  in  the  data. 
This  could  occur  if  the  problem  solver  using  the 
knowledge  produced  by  the  learning  system 
operates  in  a  dynamic  environment.  For 
example,  in  the  banking  scenario  presented 
earlier,  the  bank  might  send  out  a  notice  saying 
they  are  going  to  be  closed  next  week  to  update 
their  accounting  system.  That  information  is 
added  to  our  knowledge  about  bank  hours  so 
that  we  will  not  go  to  the  bank  next  week.  That 
type  of  information  also  needs  to  be  added  to  the 
knowledge  structures  produced  from  the 
learning  system  so  the  problem  solver  using  it 
can  account  for  the  new  knowledge.  This 
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suggests  tlw  importance  of  being  able  to 
integrate  knowledge  from  different  sources. 
Figure  1  demonstrates  this  concept.  Some 
learning  systems  deal  with  this  by  having  the 


user  generate  new  data  with  die  new  knowledge 
(Figure  IB).  It  is  simpler  and  moie  reliable  if 
the  knowledge  can  be  added  directly  to  the 
systems  knowledge  structures  (Figure  1C). 


Figure  1.  Using  New  Knowledge 
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CSNAP  is  more  like  a  conceptual  clustering 
system  (Michalsld,  1983)  than  a  classification 
system.  Just  because  two  events  have  similar 
rates  does  not  (necessarily)  mean  that  they 
belong  together.  As  Figure  2  illustrates,  there  is 
a  continuum  between  classification  systems  and 
clustering  systems.  Classification  systems 
assume  that  one  attribute  identifies  the  class  of 
an  event.  Pure  clustering  systems  assume  that  all 
attributes  are  (nearly)  equal  in  their  importance 
to  the  found  classes.  CSNAP  is  more  closely 
related  to  a  goal  directed  clustering  system 
(Stepp,  1986)  because  while  the  rate  attribute  is 
treated  specially,  it  does  not  determine  class 
membership  by  itself.  In  addition,  CSNAP 
develops  new  clusters  of  the  dependent  variable 
values  (new  classes)  in  an  attempt  to  perform 
prediction  accurately. 

Statistics  can  be  used  to  induce  a  model  from  a 
set  of  data.  For  example,  linear  regression 
provides  the  best  fit  line  to  a  set  of  points. 
Models  produced  with  statistical  techniques 
handle  real  values  very  well  and  often  provide 


a  measure  of  confidence  in  ti^  results. 
Unfortunately,  the  results  are  not  easy  to 
understand  by  the  non-technician.  It  is  also  not 
clear  how  to  add  new  knowledge  to  the  model 
other  then  performing  the  steps  outlined  in 
Figure  IB.  Several  machine  learning  systems 
such  as  ID3  ((Quinlan,  1986)  and  CART 
(Breiman,  1984)  have  used  statistics  or  similar 
measures  to  induce  a  model.  However,  these 
systems  used  statistics  to  create  the 
classifications  but  did  not  adjust  those 
classifications  in  order  to  improve 
understandability  of  the  resulting  concept. 
Connectionist  systems,  though  able  to  produce 
good  classifications,  do  not  provide  the  learned 
knowledge  in  a  manner  that  is  easily 
comprehended  or  extended  by  humans. 
CSNAP  attempts  to  integrate  the  use  of  statistics 
with  the  desire  to  have  understandable  concepts. 
As  explained  in  the  next  section,  it  is  the 
struggle  to  provide  results  that  are  both  highly 
accurate  and  comprehensible  that  makes 
CSNAP  unique. 


Position  along  this  continuum  indicates  the  degree  to  which  a  subset  of  the 
attributes  is  selected  as  the  prima  *y  basis  for  clustering.  Other  attributes  are 
used  only  to  describe  the  resulting  clusters/classes. 


Attributes  treated  more  equally 


Supervised 
Learning  i _ 

Clustering 

1 — 
ID3 

PLS 

CSNAP 

COBWEB^ 

AQ 

AUTOCLASS 

CLUSTER/2 

Figure  2.  Continuum  between  supervised  and  clustering  learning  systems. 
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3.  CSNAP  Implementation 

The  basic  CSNAP  algorithm  is  presented  in 
Figure  3.  The  essence  of  the  approach  is  in  steps 
2  and  3,  in  which  a  kernel  of  events  is  selected 
and  an  initial  description  is  formed.  First  the 
system  splits  the  events  classifi^  at  a  node  into 
two  clusters  of  events  in  order  to  minimize  the 
variance  of  the  dependent  attribute  in  the  two 
clusters.  The  procedure  to  build  the 
descriptions  then  uses  these  initial  clusters  as  a 
guide  to  constructing  the  classes.  Events  are 
moved  between  the  two  clusters  in  order  to 
construct  a  cohesive  class.  As  points  are  moved, 
the  variance  of  the  clusters  tends  to  increase,  but 
the  moves  are  selected  to  make  the  class 
descriptions  more  comprehensible.  There  is  a 
tradeoff  between  the  concise,  clear  descriptions 
and  low  variance.  The  user  can  adjust  weights 
that  are  used  to  balance  these  conflicting  goals. 


The  system  performs  a  beam  search  (xi  the  space 
ofpossible  class  descriptions.  Initially  (xie  class 
is  given  the  description  NIL,  which  covers  all 
events.  The  description  is  extended  by  adding 
new  attribute/value  pairs  (constraints)  to  the 
description  in  an  attempt  to  cover  one  of  the  two 
initial  clusters  (the  classes  formed  by 
minimizing  the  variance).  Preference  for 
adding  attributes  to  a  description  are  determined 
by  the  percentage  difference  of  values  between 
the  two  original  clusters  of  examples.  The 
percentage  difference  of  values  is  computed  by 
determining  the  percent  of  examples  of  the  class 
that  have  a  specific  attribute  value  and  getting 
the  difference  in  this  value  for  the  two  classes. 
Adding  a  previously  unused  attribute  makes  the 
description  more  specific  and  causes  examples 
covered  by  the  more  general  class  to  be  moved 
to  the  other  class  for  this  split  point.  Adding  a 
value  to  an  attribute  already  used  in  the 
description  makes  the  description  more  general. 


CSNAP  (points,  current-node) 

1)  Sort  points  on  dependent  variable 

2)  Find  the  cluster  of  points  nh  'nallest  projected  variance 

3)  Build  a  description  of  the  ci  u  >  icr 

(Move  points  to/from  cluster  as  needed  to  maintain  a  simple  description) 

4)  Set  new-points  =  points  coveted  by  description 

5)  Create  new-node  with  new-points  as  child  of  current-node 

6)  Call  CSNAP  (new-points,  new-node) 

7)  Set  points  =  points  -  new-poinu^ 

8)  Call  CSNAP  (points,  current-node) 

_ Figure  3.  CSNAP  Algorithm _ 
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The  descriptions  are  ranked  for  the  search  based 
upon  a  weighted  sum  of  the  following: 

♦  Variance 

1)  The  variance  of  the  events  covered  by 
the  description  should  be  as  small  as 
possible. 

2)  The  variance  of  the  events  not  covered 
by  the  description  should  be  as  small  as 
possible. 

♦  Goierality 

The  fraction  of  the  description  space 
covered  by  the  description  should  be  as 
large  as  possible. 

♦  Complexity 

The  number  of  attributes  and  values  used  in 
the  description  should  be  as  small  as 
possible. 

♦  Number  of  examples  covered 

1)  The  number  of  examples  covered  that 
were  in  one  of  the  original  clusters 
should  be  as  large  as  possible. 

2)  The  number  of  examples  covered  that 
were  in  the  other  original  cluster  should 
be  as  small  as  possible. 

Each  of  these  values  is  normalized  and  scaled  to 
be  between  0  and  1 .  The  weights  for  each  value 
can  be  adjusted  to  reflect  user  preferences  on  the 
system’s  results. 

The  outlined  approach  is  similar  to  1D3 
(Quinlan,  1986)  from  the  standpoint  of  splitting 
off  points,  but  similar  to  AQ  (Michalski,  1975) 
in  the  way  that  descriptions  are  constructed.  It 


is  a  clustering  system  because  die  class 
memberships  are  determined  dynamically  as 
the  clusters  are  built. 

It  should  be  noted  that  step  2  of  Hgure  3 
requires  the  calculation  of  ^*ptojected  variance.” 
Projected  variance  is  closely  related  to 
traditional  variance  but  is  based  (m  a  statistical 
estimate  of  how  large  the  variance  of  the  cluster 
might  become  as  the  number  of  samples 
increases.  This  penalizes  small  clusters  because 
adding  new  samples  can  radically  change  their 
variance.  If  traditional  variance  were  used 
clusters  of  size  one  would  always  have  the  best 
score.  Effectively,  the  system  calculates  a 
tradeoff  of  the  variance  to  the  number  of  points 
included  in  the  set  This  allows  the  system  to 
renim  larger  classes  with  slightly  worse 
variance  over  smaller  classes  with  better 
variance.  In  addition,  the  system  requires  that  a 
minimal  number  of  examples  be  included  in  a 
class.  This  provides  a  parameter  that  allows  the 
user  to  control  the  noise  tolerated. 

While  discovering  the  classes,  CSNAP  builds  a 
classification  tree.  The  tree  is  organized  with 
the  root  covering  all  the  event  descriptions  and 
its  branches  covering  more  specific  subsets. 
This  insures  that  all  possible  event  descriptions 
are  covered  by  some  node.  Although  binary 
splits  are  used  in  finding  the  minimal  variance 
for  the  classifications,  the  trees  produced  are  not 
binary.  The  found  class  with  the  smallest 
variance  becomes  a  child  of  the  other  class, 
which  remains  as  the  current  node.  The  child  is 
then  processed  to  determine  if  other  splits  can  be 
found  and  likewise,  the  current  node  is  also 
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processed  again  to  determine  if  other  splits  may 
occur.  Thus,  a  node  can  have  any  number  of 
children,  as  Icmg  as  each  split  produces  an 
improvement  in  dte  variance  of  the  data. 

The  classification  trees  produced  by  CSNAP 
provide  the  capability  to  answer  a  query  about 
the  dependent  variable  for  a  specific  state 
(time),  determine  the  state  with  the  predicted 
optimal  dependent  attribute  value,  and 
incoiporate  user’s  knowledge  by  adding  nodes. 

4.  Learning  TVaffic  Patterns 

In  this  section,  results  of  running  CSNAP  on  a 
model  of  building  traffic  are  presented.  This 
model  is  not  used  to  help  fonn  classes,  only  to 
generate  the  training  events.  The  dependent 
variable  is  the  traffic  rate  of  a  building,  the 
number  of  people  entering  and  exiting  the 


building.  The  examples  fincrni  the  model  ate 
described  with  eleven  attributes  (shown  in 
Hgure  4)  and  the  dependent  variable  rate.  TIk 
model  takes  into  account  a  number  of  factors 
when  determining  the  rate  including  season  and 
time  of  day.  When  die  model  creates  a  data 
point,  it  produces  the  typical  traffic  rate  for  die 
specified  time  period.  The  actual  number  of 
passengers  observed  is  generated  by  sampling  a 
distribution  determined  by  the  traffic  rate 
supplied  by  the  model.  It  is  this 
non-deterministic  number  of  passengers  that  is 
used  by  CSNAP  for  learning.  This  captures  the 
idea  that  in  real  buildings  the  rate  will  not  be  the 
same  every  day  at  the  same  time;  there  exist 
natural  fluctuations  in  the  data. 

Two  months  of  data,  sampling  the  traffic  every 
five  minutes,  produced  over  17,000  examples 


Attributes 

Tvoe 

Values 

rate 

real 

>0 

seconds 

circular 

0-60  (integer) 

minutes 

circular 

0-60  (integer) 

hour 

circular 

midnight,  lam,  2am, . . .,  11pm 

day-or-^iight 

nominal 

day,  night 

day-of-week  circular 

mon,  tue, . . .,  sun 

day-type 

nominal 

weekend,  weekday 

day 

circular 

1-31  (integer) 

week 

circular 

first,  second,  third,  fourth,  fifth 

month 

circular 

jan,  feb, . . .,  dec 

year 

linear 

>1980 

season 

circular 

winter,  spring,  summer,  autumn 

Figure  4.  Building  Model  Attributes 
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for  training.  The  test  set  was  produced  by 
running  the  model  over  a  (simulated)  one  week 
period  (five  minute  sampling)  immediately 
following  the  (simulated)  two  month  training 
period.  Results  are  shown  in  Figure  5 
comparing  CSNAP,  ID3,  and  a  10-point 
moving  average  output  for  both  the  actual  data 
point,  and  the  true  model  value.  The  actual  data 
section  compares  the  systems’  predictions  to  the 
observed  traffic  rate.  The  model  section 
compares  the  predicted  rates  with  the 
underlying  model  of  the  simulator.  In  the  flgure, 
AAE  is  the  average  absolute  error  of  the  value 
from  the  learned  model,  Var  is  the  variance  of 
those  values,  and  Max  is  the  maximum  value 
difference  (maximum  error  in  the  test  set).  For 
the  ID3  results,  the  rates  for  the  training 
examples  were  rounded  to  the  nearest  integer. 


and  each  integer  rate  defines  a  unique  class.  As 
can  be  seen,  CSNAP  performs  much  better  than 
the  other  approaches.  Figure  6  graphically 
shows  the  data  for  a  single  day  with  the  true 
model,  the  actual  data,  and  the  traffic  predicted 
by  the  CSNAP  model.  This  figure  illustrates 
that  CSNAP  has  learned  a  pattern  very  similar 
to  the  original  model,  based  only  (m  noisy 
samples  drawn  ftom  that  model,  the  actual  data. 
CSNAP  did  not  require  a  manually  developed 
model  of  the  environment,  a  description  of 
numbers  or  types  of  classes  to  be  found,  or  any 
theoretical  assurances  that  the  attributes  are 
independent.  Even  so,  CSNAP  was  able  to 
create  an  accurate  and  fairly  easy  to 
understandAnodify  empirical  model  of  the 
underlying  process. 


Actual  Data 


AAE 

Var 

Max 

CSNAP 

1.04 

3.47 

30.55 

ID3 

1.46 

14.55 

47.00 

Mov  Ave. 

1.64 

13.12 

35.60 

Model 

AAE 

Var 

Max 

CSNAP 

0.57 

1.89 

24.95 

ID3 

1.31 

14.56 

47.17 

Mov  Ave. 

1.51 

12.74 

29.60 

Figure  5. 

Results  on 

Traffic  Data 

Learned  &  Real  October  1,  1992 
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Rate  in  passengers  per  minute 


Cf\ 

ao 

C' 

in 


fO 


s 


Figure  6.  Data  Plots  for  One  Day 


10  llNoon  123456789  10  11 
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Since  IDS  was  not  developed  to  handle  real 
valued  dependent  variables,  the  bin  size  of 
1  (integers)  used  in  the  previous  results  might 
not  be  the  most  appropriate.  Figure  7  shows  a 
plot  of  the  errors  for  a  variety  of  different  bin 
sizes  for  IDS.  The  original  bin  size  was  1.  that 
is  the  actual  rates  were  rounded  to  integers  and 
placed  in  a  class.  As  shown  in  the  graph,  this 
choice  was  near  optimum  for  the  binning  size. 
Even  with  much  smaller  bin  sizes  (0.02)  IDS 
does  not  perform  significantly  better  and  the 
error  rate  is  still  much  higher  than  that  of  the 
CSNAP  approach.  This  helps  to  demonstrate 
that  the  problems  CSNAP  was  designed  to 


address  are  not  pure  teaming  from  example 
tasks.  In  addition,  CSNAP  he^w  to  einninate  the 
step  of  manually  choosing  die  conect  bin  size  - 
CSNAP  autmnatically  identifies  useful 
dependent  value  clusters  (bins). 

Figure  8  shows  how  die  average  error  of  the 
predictions  decreases  as  CSNAP  constructs  die 
classification  tree.  The  solid  line  is  die  average 
error  over  the  one  week  testing  period  in 
passengers  per  minute.  The  dashed  line  is  the 
average  error  for  the  tree  plus  one  standard 
deviation  of  the  error  (equivalent  to  an  error  bar 
of  one  standard  deviation  for  the  average  error 
line).  CSNAP  was  designed  to  construct  classes 


Category  Size  (Log  Scale) 


Figure  7.  Errorfor  Different  IDS  Category  Sizes 


that  have  small  variance  and  this  graph  indicates 
CSNAP  is  accomplishing  that  goal.  By  taking 
the  difference  between  the  two  lines,  it  can  be 
seen  that  the  standard  deviation  of  error 
decreases  from  4.2  to  1 .3  passengers  per  minute. 
This  graph  also  indicates  that  the  CSNAP 
system  does  not  over  learn  from  the  training 
data.  If  it  had,  the  error  would  increase  as  the 
tree  grew.  The  completed  tree  has  470  nodes. 


While  this  might  not  be  as  small  as  desired,  the 
ID3  tree  for  this  problem  had  over  67,000 
nodes.^^1  This  graph  (Figure  8)  was  produced 
by  testing  the  CSNAP  tree  after  each  node  was 
added  as  it  worked  on  the  training  data  set  No 
pruning  was  performed. 


[1].  A  standard  version  of  IDS  with  CHI  squared 
pruning  was  used. 
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Part  of  a  classification  tree  produced  by  CSNAP 
is  shown  in  Figure  9.  The  tree  shown  was 
generated  with  4000  examples  from  the  above 
training  set.  Each  node  of  the  tree  is  indicated 
on  a  separate  line,  with  children  being  indented 
to  the  right  under  the  parent.  The  first  number 
in  each  line  is  a  line  number  to  be  used  for 
reference.  The  I  symbol  is  usedjust  to  assist  in 
lining  up  the  nodes.  The  description  of  the 
nodes  is  placed  between  [  and  ].  In  this 
example  all  the  descriptions  are  a  single 
attribute  but  conjunctions  of  attributes  are 
allowed.  Internal  disjunction  of  the  attribute 
values  is  allowed  as  indicated  by  ...  (see 
line  18).  The  first  number  following  the 
description  is  the  average  value  of  the 
dependent  attribute  for  all  examples  currently 
classified  by  this  node.  The  second  number  is 
the  variance  for  these  examples  and  the  third 
number  is  the  number  of  examples  currently 
classified  by  this  node.  CSNAP  attempts  to 
push  events  as  far  down  in  the  tree  as  possible 
when  classifying  new  points.  For  example,  if  an 
event  comes  in  that  has  DAY-OF-WEEK  = 
Wednesday  and  HOUR  =  9am  and  a  prediction 
is  wanted,  the  system  would  place  the  event  in 
the  first  child  node  that  can  match  (cover)  the 
event  description.  In  this  example,  that  is  the 
node  on  line  4.  Since  Wednesdays  are  not 
covered  by  any  of  this  node’s  children,  the 
prediction  would  be  3.0  passengers  per  minute 
(on  average  with  a  variance  of  1.87  passengers 
per  minute  based  on  a  sample  of  24  events).  If 
instead  the  event  had  been  a  Friday,  the 
prediction  would  have  been  2.37  (line  5).  Had 
the  event  in  question  occurred  on  a  Saturday,  the 


prediction  would  have  been  0. 19  passengers  per 
minute  (line  2).  It  should  be  noted  that  it  is 
possible  for  a  child  of  a  node  to  have  a  larger 
variance  than  its  parent  when  the  tree  is  finally 
constructed.  When  the  node  was  cviginally 
separated  from  its  parent  the  node’s  variance 
must  have  been  smaller  but  since  that  time  other 
examples  could  have  been  split  from  the  parent 
leaving  it  with  a  better  overall  variance.  Only 
the  final  examples  left  in  a  node  are  u.sed  to 
calculated  its  prediction  (average). 

It  should  also  be  observed  firom  this  tree 
segment  that  adding  knowledge  to  the  tree  is 
trivial.  For  example,  if  knowledge  were  known 
that  the  traffic  rate  would  increase  at  2pm  on 
Thursday,  this  knowledge  could  be  added  as  a 
node  after  the  root.  As  long  as  the  information 
is  added  in  the  tree  before  the  “standard”  node 
used  for  that  prediction,  the  new  knowledge  will 
override  the  learned  predictions. 

5.  Conclusion 

Adding  the  powerful  features  of  statistics  to 
symbolic  machine  learning  systems  is 
important.  Using  statistics,  symbolic  systems 
can  add  a  confidence  factor  to  their  results.  This 
confidence  factor  is  important  to  those  using  the 
knowledge  produced  by  the  learning  system.  So 
important  that  some  people  will  not  use  a  system 
without  them.  Statistics  must  be  integrated  into 
the  system  so  that  the  advantages  of  symbolic 
learning,  such  as  comprehensible  descriptions 
are  not  sacrificed.  Unless  the  knowledge 
produced  does  not  need  to  be  understood  by 
humans,  it  must  be  a  form  that  they  can 
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1  []  17.20  33.21  20 

2  I  ((DAT-TXPB  >  WEgKBMD  )]  0.19  0.17  1440 

3  I  [ (D&T-OR-IUGBT  -  MIGHT  )]  1.49  0.32  1560 

4  I  [(BOOR  >  9HM  )]  3.00  1.87  24 

5  I  I  [  (DHT-<4r-1BXX  -  VRl  )]  2.37  1.01  24 

6  I  I  [ (OAY-OF-liSEX  >  TBO  )]  2.67  1.58  24 

7  I  I  [ (DHT-Or-HBSX  «  MOM  )]  2.37  1.62  24 

8  I  I  [  (OAT-OF-HBBK  -  TOB  )]  2.92  1.84  24 

9  I  [(BOOR  ■  3PM  )]  4.08  2.46  24 

10  I  I  [ (D&Y-Or-MBBX  -  TOE  )]  3.96  2.29  24 

11  I  I  [  (ORT-OF-1IEBK  -  FRX  )]  4.17  2.38  24 

12  I  I  ( (DJLY-Or-fBBK  «  MOM  )]  4.42  3.01  24 

13  I  I  [ (DRY-OF-MBEK  >  HZD  )]  3.87  2.25  24 

14  I  [(BOOR  s  lORM  )]  0.00  0.00  0 

15  I  I  [ (DRY-<^-HBEK  >  XHO  ) ]  5.04  3.76  24 

16  I  I  [ (DRY-OF-MBBK  »  MED  ) ]  4.62  3.78  24 

17  I  I  [  (DAY-OF-HEBK  -  MOM  )]  4.62  3.96  24 

18  I  I  [(MZMOTES  »  THBMTIETB-MZM. . .TEMTB-HZM)]  5.12  3.86  24 

19  I  !  I  [ (DAY-OF-HEEK  «  TOE  )]  5.46  3.90  24 

20  I  KBOOR  »  2PM  )]  5.54  4.72  24 

21  I  I  [ (OAY-OF-HBBK  =  TOE  )]  4.87  2.97  24 

22  I  I  [ (DAY-OF-NEEK  -  FRX  )]  4.83  4.47  24 

23  I  )  [(DAY-OF-HEEK  a  TBO  )]  5.17  4.09  24 

24  I  I  [(DAY-OF-HEEK  >=  HED  )]  5.12  4.82  24 

25  I  [(BOOR  -  7AM  )]  26.15  53.49  20 

26  I  I  [(KIMOTBS  =:  TEMTB-HXM  )]  2.50  0.91  20 

27  I  I  [(HXMOTES  «  ZERCTH-HIM  )]  2.45  1.00  20 

28  I  I  [(MIMOTES  =  TBIRTIETB-MXM  )]  2.70  1.13  20 

29  I  I  [(MIMOTES  =  THENTIEXB-MZM  )]  2.20  1.16  20 

30  I  I  [(MIMOTES  =  FORXIETB-HIM  )]  13.55  28.98  20 

31  I  [(BOOR  5PM  )]  31.50  295.18  8 

32  I  I  [(MIMOTES  =  FIFTIETB-MIM  )]  2.40  0.98  20 

33  I  I  [(DAY-OF-HEEK  »  FRI  )]  13.50  221.43  20 

34  I  I  [(DAY-OF-HEEK  =  TOE  }]  14.25  236.78  20 

35  I  I  [(DAY-OF-HEEK  >=  MOM  }]  14.30  251.19  20 

36  I  I  [(MIMOTES  =  TBIRTIETB-MIM...FORTIETB-HIN)]  3.62  3.57  8 

37  I  I  I  [(MINOXES  -  TBIRTIE7B-MIM  )]  4.25  4.22  8 

38  I  I  [(MIMOTES  »  THEMTIETB-MIN  )]  5.62  6.79  8 

39  I  I  [(DAY-OF-HEEK  »  HED  )]  21.50  169.46  8 

40  I  [(BOOR  -  8AM  )]  39.30  138.77  20 

41  I  I  [(MIMOTES  -  THEMTIETB-MIN  )]  2.65  0.71  20 

42  I  I  [(MIMOTES  =  FIFTIETB-MIN  )]  2.15  0.80  20 

43  I  I  [(MIMOTES  =  TBIRTIETB-MIN  )]  2.15  0.90  20 

44  (  I  [(MIMOTES  *  FORTIETH-MT’  )]  2.35  1.06  20 

45  I  I  [(MIMOTES  >>  TEMTH-MIM  /  13.75  19.31  20 

46  I  [(BOOR  «  1PM  )]  7.17  11.86  24 

47  f  I  [(DAY-OF-HEEK  TOE  )]  6.96  7.19  24 

48  I  I  [(DAY-OF-HEBK  =  FRX  )]  8.54  12.04  24 

49  I  I  [(DAY-OF-HEEK  TBO  )]  7.54  7.43  24 

50  I  i  [(DAY-OF-HEEK  =  HED  )]  8.50  12.10  24 


Figure  9.  Segment  of  CSNAP  tree 
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comprehend  and  extend  as  they  deem  necessary. 
CSNAP  is  an  attempt  to  use  statistics  in  this  way. 

Although  it  has  been  shown  that  useful  results 
can  be  obtained  by  this  approach  there  are  a 
number  of  areas  for  future  work.  In  the  future  we 
expect  to  expand  our  comparison  of  CSNAP’s 
results  to  results  produced  by  other  algorithms 
such  as  CART,  AQ,  Cobweb,  and  Autoclass. 
This  task  is  difficult  because  direct  comparisons 
between  different  algorithms,  intended  for 
different  purposes,  are  often  misleading.  In 
addition,  it  is  often  difficult  to  find  standard 
implementations  of  these  algorithms  that 
include  all  the  features/extensions  need  for 
good  performance. 

In  the  future  we  also  expect  to  expand  CSNAP 
to  identify  more  complex  relationships  in  the 
data.  Currently,  CSNAP  is  essentially  searching 
for  regions  of  the  space  where  the  dependent 
attribute  value  is  constant.  Other,  more  complex 
relationships,  such  as  a  linear  dependence  on  a 
single  attribute,  should  also  be  considered.  This 
not  only  could  provide  more  reasonable 
generalization,  but  more  understandable 
descriptions  as  well. 
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Cooperation  of  Data-driven  and  Model-based 
Induction  Methods  for  Relational  Learning 
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Abstract 

Inductive  learning  in  relational  domains  has 
been  shown  to  be  intractable  in  general. 
Many  approaches  to  this  task  have  been  sug¬ 
gested  nevertheless;  all  in  some  way  restrict 
the  hypothesis  space  searched.  They  can  be 
roughly  divided  into  two  groups;  data-driven, 
where  the  restriction  is  encoded  into  the  al¬ 
gorithm,  and  model-btised,  where  the  restric¬ 
tions  are  made  more  or  less  explicit  with  some 
form  of  declarative  bias.  This  paper  describes 
Incy,  an  inductive  learner  that  seeks  to  com- 
i)iue  aspects  of  both  approaches.  iNCY  is  ini¬ 
tially  data-driven,  using  examples  and  back¬ 
ground  knowledge  to  put  forth  and  special¬ 
ize  hypotheses  based  on  the  “connectivity”  of 
the  data  at  hand.  It  is  model-driven  in  that 
hypotheses  are  abstracted  into  rule  models, 
which  are  used  both  for  control  decisions  in 
the  data-driven  phase  and  for  model-guided 
induction. 

Key  Words:  Inductive  learning  in  relational 
domains,  cooperation  of  data-driven  and 
model-guided  methods,  implicit  and  declar¬ 
ative  bias. 

1  Introduction 

Inductive  learning  in  relational  domains  has 
been  shown  to  be  intractable  in  general 
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[Kietz  93].  Many  approaches  to  this  task 
have  been  suggested  nevertheless;  all  in  some 
way  restrict  the  hypothesis  space  searched. 
This  is  done  first  by  restricting  the  lan¬ 
guage  in  which  examples  and  background 
knowledge  may  be  expressed.  Additionally, 
the  language  in  which  the  hypotheses  are 
expressed  is  restricted.  Data-driven  meth¬ 
ods,  such  as  FOIL  [Quinlan  90]  and  GOLEM 
[Muggleton/Feng  90],  encode  this  restriction 
into  the  algorithm.  In  Model-based  systems 
such  as  RDT  [Kietz/Wrobel  92],  GRENDEL 
[Cohen  92]  and  CLINT  [Raedt  91],  the  re¬ 
strictions  are  made  more  or  less  explicit  with 
some  form  of  declarative  bias^ 

Despite  the  challenge,  learning  relational 
concepts  heis  great  practical  relevance,  as 
evidenced  by  the  ESPRIT  projects  Ma¬ 
chine  Learning  Toolbox  (MLT)  and  Induc¬ 
tive  Logic  Programming  (ILP)  funded  by  the 
European  Community^.  The  MobaL  sys¬ 
tem  for  building  knowledge  based  appli¬ 
cations  [Morik  91][Morik  et  al.  93]  has  been 
used  to  develop  several  complex  and  practice- 
oriented  applications.  Experience  has  shown 
that  both  data-driven  and  model-based  ap¬ 
proaches  offer  advantages  in  such  a  knowledge 
engineering  context  [Sommer  et  al.  ].  Specif¬ 
ically,  FOIL  showed  great  speed  in  passing 
over  the  data  discussed  there,  but  the  result¬ 
ing  rules  where  not  satisfactory  in  coverage 

^For  a  comparison  of  GOLE.M,  RDT  and  CLINT 
see  [Sutlic  92] 

"P2145  and  P6020  respectively.  The  work  re¬ 
ported  on  here  is  funded  in  part  througli  the  latter. 
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of  the  goal  concept.  More  significantly  FOIL 
(and  the  class  of  data-driven  methods  u  rep¬ 
resents)  is  a  black  box  whose  behavior  can¬ 
not  be  modified.  Some  of  its  results  may  be 
statistically  correct,  but  do  not  fit  the  de¬ 
sign  goals  for  the  knowledge  base,  and  their 
discovery  inhibits  the  search  for  more  “sensi¬ 
ble”  alternative  rules.  RDT,  which  is  paxt 
of  Mobal,  on  the  other  hand  was  slower, 
but  the  results  showed  better  coverage.  More 
significantly,  the  declarative  bias  RDT  uses 
can  be  tailored  incrementally,  so  that  later 
results  where  both  better  in  coverage  than 
foil’s  and  comparable  in  speed.  The  work 
reported  on  this  paper  is  an  attempt  at  com¬ 
bining  the  speed  of  heuristically  guided  data- 
driven  methods  with  the  variability  of  model- 
based  approaches. 

2  INCY 

A  rule  model  is  a  higher  order  expression 
similar  to  a  rule,  except  that  predicate  vari¬ 
ables  appear  at  the  place  of  predicates  (refer 
to  [Kietz/VVrobel  92]  for  a  precise  definition). 
A  rule  is  so  seen  as  being  an  instance  of  a 
corresponding  rule  model,  where  the  model’s 
predicate  variables  are  instantiated  with  spe¬ 
cific  predicate  names.  A  model  based  learner 
such  as  RDT  generates  hypotheses  by  per¬ 
forming  these  instantiations  in  a  systematic 
manner  with  predicates  from  the  domain  at 
hand'^. 

Rule  models  can  and  do  exist  indepen- 
dantly  of  a  specific  domain  (cf.  “cliches” 
in  [Silverstein/Pazzani  91].).  Naturally,  such 
generic  models  will  not  fit  arbitrary  data. 
I.XCY  is  learning  algorithm  designed  to  make 
a  somewhat  frivolous  pass  through  a  given  do¬ 
main  and  generate  example  hypotheses  and 
rule  models  based  on  the  connectivity  of  the 

^Tlit;  use  of  rule  models  as  declarative  bias  goes 
back  to  [Emde  87];  a  very  similar  approach  is  used  in 
[Silverstein/Pazzani  91]. 


Figure  1;  Fact  graphs  before  and  after  ex¬ 
panding  object  z 

specific  data  at  hand.  Even  if  none  of  the  hy¬ 
potheses  are  acceptable,  or  Incy’s  rules  do 
not  cover  a  suflRciently  large  subset  of  the 
known  examples  of  the  learning  goal,  a  host  of 
rule  models  are  produced  that  directly  reflect 
the  structure  of  the  data  wrt  to  connectiv¬ 
ity,  i.e.,  which  arguments  of  which  predicates 
appear  at  which  places  in  which  other  predi¬ 
cates. 

2.1  Connectivity 

The  motivation  for  Incy’s  inner  workings 
comes  from  a  form  of  data  inspection  pro¬ 
vided  by  MobaL:  the  fact  graph  shows  an 
incrementally  extendable  excerpt  of  a  given 
knowledge  base  ais  a  graph.  Facts’  arguments 
(the  objects  in  the  domain)  are  nodes,  and 
the  predicates  in  which  they  appear  are  arcs. 
Given  an  example  of  the  learning  goal  in  a  do¬ 
main,  the  graph  initially  shows  all  the  facts 
known  about  the  example’s  arguments,  with 
the  arcs  linking  arguments  to  other  objects. 
The  graph  can  be  expanded  to  show  all  facts 
about  these  other  objects,  causing  yet  other 
objects  to  appear,  etc.  (Fig.  1).  Eventually. 
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the  graph  will  be  fully  expanded,  displaying 
all  facts  about  the  example’s  arguments  and 
objects  they  are  linked  to.  Some  conjunction 
of  a  subset  of  these  facts  must  be  a  valid  rule 
about  the  concept.  Unfortunately,  the  num¬ 
ber  of  conjunctions  theoretically  possible  is 
quite  large;  often  all  facts  in  the  domain  are  in 
some  way  ’connected’  to  the  objects  appear¬ 
ing  in  the  example,  so  that  any  member  of  the 
powerset  of  the  set  of  all  facts  in  the  domain 
is  a  possible  hypothesis.  The  idea  is  to  incre¬ 
mentally  expand  the  set  of  candidates,  guided 
by  the  incrementally  expanding  graph.  If  we 
have  a  learning  goal  foobar(x,y ,z)  and  the 
partial  hypothesis 

foo(x)  k  foo(y)  — ♦  foobar(x,y,z) 
we  will  want  to  add  something  about  z  to  the 
premise  before  testing  the  validity  of  the  hy¬ 
pothesis.  The  candidate  must  be  among  the 
facts  about  z: 

foobar(x,y,z) 
bar(v,z) 
bar(y,z)  ... 

The  question  is:  which  of  these  is  the  best 
new  conjunct?  Quinlan’s  FOIL  [Quinlan  90] 
algorithm,  for  instance,  uses  the  “information 
gain”  heuristic  to  select  a  candidate,  i.e.  its 
value  is  measured  in  terms  of  the  role  it  plays 
in  all  known  examples  of  the  goal.  The  IN¬ 
DUCE  algorithms,  though  they  are  applied 
to  structured  objects  rather  than  relational 
concepts,  can  be  seen  as  relying  on  a  pre¬ 
classification  of  candidates  with  additional 
heuristics[Michalski  83).  INCY  takes  a  more 
basic  approach,  relying  only  on  the  connec¬ 
tivity  information  represented  by  the  graph. 
It  tries  to  select  a  candidate  which  is  “most 
linked”  to  the  other  objects  in  the  rule,  and 
which  preferably  does  not  introduce  a  new 
object.  In  other  words,  it  tries  not  to  ex¬ 
pand  the  graph,  but  rather  find  a  fact  con¬ 
cerning  objects  already  visible.  In  the  exam¬ 
ple,  Incy  would  give  preferance  to  bar(y  ,z), 
because  bar(v,z)  would  introduce  a  new  ob¬ 
ject  v  (Fig.  1). 


Vere’s  Thoth-pb  induction  method  for  re- 
lationcd  productions  uses  association  chains 
to  selectively  augment  examples  with  back¬ 
ground  information  before  generalization 
[Vere  77).  Rather  than  logical  induction,  the 
topic  here  is  finding  operators  that  describe 
change  in  discrete  scenes,  similar  to  string 
rewrite  rules  and  STRIPS  operators.  But  the 
idea  of  selecting  descriptors  along  relational 
links  between  objects  aind  being  conservative 
about  expanding  the  association  chain  of  such 
links  is  related  to  the  approach  taken  here. 

2.2  The  Incy  algorithm 

Incy  begins  by  selecting  an  example  of  the 
concept  to  be  learned.  For  each  object  ap¬ 
pearing  as  an  argument  in  the  example,  the 
set  of  known  facts  about  it  is  collected  (the 
about  set).  One  candidate  from  each  of  these 
sets  is  selected  to  form  a  preliminary  premise 
for  the  hypothesis.  More  than  one  of  the  ob¬ 
jects  appearing  in  the  example  may  occur  in 
one  candidate  fact,  so  that  the  resulting  can¬ 
didate  set  may  be  of  size  smaller  than  the 
arity  of  the  example.  iNCY  keeps  track  of  the 
candidates  used  across  the  iterations  of  this 
selection  of  preliminary  conjuncts  for  one  ex¬ 
ample  to  be  able  to  avoid  specializing  permu¬ 
tations  of  the  same  hypothesis. 

Linked-enough  analysis 

This  premise  is  subjected  to  a  “linked- 
enough”  analysis  before  actually  testing  the 
hypothesis.  In  the  course  of  this  analysis,  the 
premise  may  be  augmented  by  one  or  more 
conjuncts  to  ensure  that  all  arguments  are 
sufficiently  linked.  An  argument  is  linked- 
enough  if  is  not  free,  i.e.  it  appears  in  at 
least  two  literals  of  the  hypothesis.  It  is 
linked  to  the  other  arguments  of  the  literals 
it  occurs  in.  iNCY  collects  the  ’problem’  ar¬ 
guments  that  do  not  pass  the  linked-enough 
test  and  for  each  tries  first  to  find  a  new 
conjunct  linking  it  to  the  head  arguments. 


Incy  top  level 

while  there  are  uncovered  examples  of  the  goal  concept 

>  select  an  example 

o  construct  about  sets  for  the  example’s  args 

>  while  there  are  combinations  of  candidates 

▻  form  prelim,  conjunction  by  selecting  one  candidate  fact  from  each  about  set 

▻  perform  linked-enough  test/modifications  on  this  conjunction  of  facts 

▻  form  &  test  hypo 

▻  fail  (backtrack)  or  specialize  hypo 

>  end{hypo  generation  loop  for  one  example} 

>  fail  (backtrack  to  another  example  or  end) 
end{example  selection  loop} 

Incy  specialize  hypo 

while  premise  is  not  longer  than  c*arity(goal) 

>  construct  about  sets  for  aU  ’variables’  occuring  in  hypo 

t>  select  a  new  conjunct  for  the  premise  from  the  about  sets 
o  perform  linked-enough  test /modifications 

>  form  &  test  hypo 

>  fail  (backtrack)  or  further  specialize  hypo 
end{  specialization} 


Figure  2:  Pseudocode  for  Incy’s  data-driven  phase 


to  other  premise  arguments  otherwise  (here 
giving  preference  to  the  other  problem  argu¬ 
ments).  As  a  last  resort,  an  argument  may 
be  marked  as  a  constant  if  no  suitable  con¬ 
junct  is  found.  Section  4  gives  a  formal  defi¬ 
nition.  The  linked-enough  test /modification, 
in  contrast  to  specialization  described  below, 
does  not  introduce  new  arguments  into  the 
hypothesis.  It  can  be  interpreted  as  a  sort 
of  consolidation  phase  after  which  all  argu¬ 
ments  are  sufficiently  bound  (or  described,  or 
linked):  either  they  appear  repeatedly  in  the 
hypothesis,  or  they  are  marked  as  constants. 

Hypothesis  formation  and  test 

The  ’hypothesis’  thus  reached  is  a  conjunc¬ 
tion  of  facts  -  it  is  now  turned  into  a  true  hy¬ 
pothesis  by  systematically  substituting  vari¬ 
ables  for  constants.  Testing  procedures  as  de- 
•scribed  in  [Kietz/Wrobel  92]  are  applied  to 
this  hypothesis.  In  this  process,  a  rule  model 


is  abstracted  from  the  hypothesis  (in  anal¬ 
ogy  to  abstracting  a  rule  from  a  conjunction 
of  facts,  here  the  predicate  names  are  turned 
into  variables).  The  results  of  the  test  can  be 
one  of: 

rule  known 
rule  model  known 
hypothesis  too  specific 
hypothesis  accepted 
hypothesis  too  general 
In  the  first  four  cases,  iNCY  backtracks. 
This  is  straight-forward  in  the  hypothesis 
too  specific  case,  but  why  not  special¬ 
ize  when  rule  known,  rule  model  known 
or  hypothesis  accepted?  The  decision 
not  to  specialize  here  is  what  underlies  the 
built-in  frivolity  of  Incy’s  data-driven  phase 
and  allows  it  to  put  forth  a  maximum  of 
structurally  different  hypotheses  in  relatively 
short  time,  since  it  avoids  producing  struc¬ 
turally  equivalent  hypotheses.  In  one  strat- 
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egy  of  cooperation  discussed  (Section  3), 
a  model- based  learning  phase  is  initiated 
when  hypothesis  accepted,  so  that  such 
hypotheses  are  systematically  enumerated. 
Note  that  a  rule  model  is  retained  for  pos¬ 
sible  later  use  regardless  of  whether  the  cor¬ 
responding  hypothesis  was  accepted. 

If,  in  backtracking,  there  are  no  other  ways 
of  ensuring  the  hypothesis  is  linked-enough, 
other  prelimincury  candidates  are  selected.  If 
no  good  combinations  (as  described  at  the 
beginning  of  this  section)  are  available,  a  new 
example  is  selected.  Only  in  the  final  test 
result  case  (hypothesis  too  general)  is  the 
hypothesis  is  specialized. 

Specialization 

Specialization  is  a  modified  version  of  the 
linked-enough  modification  above:  the  new 
conjunct  should  ideally  link  a  head  variable 
to  one  of  the  premise  variables.  If  no  such 
fact  is  available,  the  next  preferred  type  is 
one  that  links  two  premise  variables  (both 
may  introduce  a  new  variable  in  addition  to 
linking  two  existing  ones).  If  this,  too,  is 
impossible,  a  new  conjunct  binding  one  of 
the  non-head  variables  is  selected.  The  cru¬ 
cial  difference  to  the  linked-enough  modifica¬ 
tion  is  that  here  new  variables  may  be  intro¬ 
duced  into  the  premise.  A  subsequent  call 
to  the  linked-enough  analysis  ensures  that 
none  of  these  remain  free.  From  here  on, 
Incy  proceeds  in  the  same  manner  as  above 
(see  Fig.  2).  The  depth-bound  for  specializa¬ 
tion  is  a  function  of  the  goal  concept’s  arity 
charity  (goal-concept),  c  can  be  modified 
via  parameter. 

3  Cooperation 

The  two  basic  forms  of  cooperation  between 
data-driven  and  model- based  methods  have 
already  been  touched  on: 

1.  A  rule  model  is  abstracted  from  a  hypoth¬ 
esis  and  used  to  test  hypotheses.  This  call 


structiue  influences  Incy’s  behavior  in  sev¬ 
eral  ways: 

•  After  abstracting  a  hypothesis  to  a  rule 
model,  the  new  model  is  compared  to  ex¬ 
isting  models  by  an  extension  of  theta- 
subsumption  [Kietz/Wrobel  92]. Redundant, 
i.e.  structurally  equivalent  models  are 
caught  at  this  point,  and  this  causes 
Incy  to  backtrack  (search  for  a  different 
specialization  or  select  different  conjucts 
for  the  preliminary  hypothesis). 

•  If  the  data-driven  phase  should  discover  a 
rule  that  is  already  known,  this  is  discov¬ 
ered  during  the  test  and  specialization  is 
aborted. 

•  The  criteria  used  to  decide  if  a  hypothesis  is 
acceptable,  too  specific  or  too  general  are 
parametrized.  Changing  their  values  cause 
Incy  to  accept  more  or  less  general  and 
more  or  less  bold  hypotheses.  Since  Incy’s 
decisions  about  when  to  specialize,  when  to 
select  different  conjuncts  about  the  current 
example’s  arguments,  and  when  to  try  a 
different  example  depend  on  the  outcome 
of  this  test,  cheinging  their  values  signi- 
fanctly  change  iNCY’s  behavior. 

2.  The  rule  models  generated  during  Incy’s 
pass  over  data  can  be  used  in  a  subse¬ 
quent,  more  stringent  amalysis.  During  this 
model-based  pass,  the  models  are  instanti¬ 
ated  with  fitting  predicates  from  the  knowl¬ 
edge  base  [Kietz/Wrobel  92].  This  may  result 
in  two  different  types  of  rules: 

•  Rules  structurally  equivalent  to  one  of  those 
discovered  during  the  data-driven  phase: 
each  of  those  initial  rules  can  be  under¬ 
stood  as  being  an  example  for  a  set  of 
rules.  Naturally,  such  structurally  equiva¬ 
lent  rules  will,  in  general,  be  quite  different 
in  “meaning”. 

•  Rules  not  structurally  equivalent  to  any  of 
the  rules  discovered  by  iNCY,  but  based 
on  one  of  Incy’s  rule  models  nevertheless. 
Incy  misses  many  acceptable  rules  be¬ 
cause  its  pass  is  guided  not  only  by  the 
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test  results  that  pertain  to  a  specific  hy¬ 
pothesis,  but  also  by  the  (pre-)  existence 
of  rule  models.  This  is  the  reason  for 
Incy’s  built-in  frivolity,  but  also  for  its 
speed.  Incy  may  discard  a  hypothesis 
whose  corresponding  rule  model  -  with  dif¬ 
ferent  instantiations  of  the  predicate  vari¬ 
ables  -  yields  plausible  rules.  These  rules 
are  found  in  the  model-based  pass. 

3.1  Other  Schemes  of  Cooper¬ 
ation 

Sequential  cooperation 
The  basic  cooperation  scheme  discussed 
above  is  dynamic  in  that  Incy  makes  model 
generation  and  hypothesis  testing  calls,  but 
learning  itself  is  strictly  sequential:  IncY  pre- 
processes  data  -  discovering  some  rules  and 
more  rule  models  -  and  a  model-guided  in¬ 
duction  step  can  be  called  subsequently.  In 
the  knowledge  engineering  context  provided 
by  Modal,  this  is  fitting,  as  it  allows  the  use 
of  Incy  for  a  quick  initial  analysis  of  data 
and  RDT  for  a  more  concerted  later  effort. 

A  more  selective  use  of  the  rule  models  pro¬ 
duced  by  Incy  is  also  possible,  however.  Af¬ 
ter  Incy  has  completed  its  pass,  for  instance, 
a  domain  expert  may  scan  the  rules  found 
and  select  those  that  seem  most  promising; 
RDT  can  be  made  to  search  only  for  similar 
rules  by  passing  on  only  those  models  which 
the  promising  rules  are  instances  of.  Alterna¬ 
tively,  the  expert  may  inspect  '  he  rule  models 
found  by  Incy  and  pass  only  those  deemed 
plausible  on  to  RDT. 

Dynamic  cooperation 
In  the  same  vein,  iNCY  may  make  selective 
calls  to  a  model- based  pass  from  within  its 
data-driven  pass  over  data.  For  instance,  be¬ 
fore  backtracking  or  specializing  when  a  hy¬ 
pothesis  is  accepted,  a  model-based  paiss  may 
be  initiated  to  learn  with  the  corresponding 
rule  model  only;  or  such  a  pass  may  be  initi¬ 
ated  with  the  list  of  models  that  were  suc¬ 


cessful  in  this  sense  after  one  of  the  data- 
driven  passes  inner  loops  has  terminated.  A 
third  possiblity  is  to  call  model-bcLsed  meth¬ 
ods  with  the  most  successful  branch  of  the 
model  subsumption  tree:  the  set  of  increas¬ 
ingly  special  (by  theta  subsumption)  models 
corresponding  to  the  highest  number  of  rules 
discovered  by  iNCY. 

The  first  of  these  dynamic  cooperation 
schemes  has  been  implemented.  Note,  how¬ 
ever,  that  this  significantly  adters  Incy’s  be¬ 
havior.  The  sequential  version  described  in 
Section  2.2  is  designed  to  quickly  discover 
a  maximum  of  v^i-structured  rules,  each  of 
which  can  be  viewed  ais  an  example  for  a 
class  of  structurally  equivcdent  rules.  On 
the  other  hand,  the  dynamic  combination  of 
model-based  and  data-driven  strategies  will 
test  the  entire  classes  of  rules  immediately, 
and  take  accordingly  longer  to  complete  a 
pass.  The  sequential  version  will  be  of  more 
use  in  a  knowledge  engineering  context  where 
a  quick  preliminary  analysis  of  large  amounts 
of  data  is  desirable.  In  the  dynamic  version, 
Incy  functions  more  as  a  model-generator  for 
an  RDT-like  method,  producing  the  models 
it  needs  on  the  fly,  based  on  the  connectivity 
of  the  knowledge  btise  at  hand. 

4  Results 

The  hypothesis  language  defined  procedu- 
rally  by  iNCY  can  be  described  as: 

~  =  Lprcjns  *  ^concl  \ 

3a  :  {Iconci  u  L  prems}^  ^  Bg 

A  linked  -  enough.({l^^rid,  Lprcm,}) 

A  \Lprems\  <  depth  -  bound{lconci)} 

where  Iconci  is  the  conclusion  literal  (the 
head),  Lprema  is  the  set  of  premise  literals, 
and  Bg  is  the  set  of  facts  that  make  up  the 
knowledge  baise^.  <t  is  a  substitution  of  terms 

^Note  that  instances  of  the  learning  goal  are  also 
in  Bg.  i.e.  Incy  may  learn  recursive  rules. 
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for  variables  such  that  two  different  variables 
are  replaced  by  different  terms: 

<?■  €  {vilU  I  V  1  <  *  <  j  <  n  : 

Vi  ^  Vj  ti  tj 

A  ti^tj  are  ground)} 

The  space  of  such  hypotheses  is  seMched  top- 
down.  Note  that  Incy’s  data-driven  phase 
discovers  only  a  subset  'R frivolous  C  'RvaUd  C 
Cns  of  all  such  rules  valid  in  a  given  domain. 
The  model-based  phase  continues  this  work, 
but  does  find  all  RvaUd  because  specializa¬ 
tion  is  aborted  when  a  rule  model  is  already 
known  (recall  Section  2.2).  The  next  restric¬ 
tion  imposed  on  the  base  language  (function- 
free  Horn  clauses  with  negation  is  that  the 
premise  together  with  the  goal  of  a  hypothe¬ 
sis  must  be  linked-enough  (Section  2.2): 

linked  -  tnuu(jh{[lconcl->  Lprems}) 

3up„m  €  :  V  =  Uprem 

A  Vu  €  VL,,r.,n.  ^Vconcl  €  : 

linked(v,  Vcond) 

A  Vu  €  €  Lprems  '■ 

/i  ^  A  u  €  V^,  A  u  €  V/j) 

where  are  the  head  variables,  and 

are  the  variables  occuring  only  in  the 
premise.  The  linked  relation  between  vari¬ 
ables  of  a  hypothesis  is  best  defined  recur¬ 
sively: 

linked{vi,V2) 

3/  €  {/cone/  U  L  prems  }  :  Ui,t;2  €  Vi 

V  3i?3  ;  linked{vi,V3)  A  linked{v2,V3) 

This  formalization  defines  a  superset  of  the 
hypotheses  that  pass  the  linked-enough  test 
in  the  top-level  loop  (Fig.  2):  the  modifi¬ 
cations  made  during  linked-enough  analysis 
there  do  not  introduce  new  variables.  These 
are  introduced  during  specialization,  so  that 
the  definition  above  is  precise  wrt  Incy’s 
overall  behavior. 

The  main  effect  of  this  is  that  hypotheses 
with  free  variables  are  not  put  forth  by  INCY. 


Note  that  variables  for  which  this  condition 
cannot  be  fulfilled  are  turned  into  constants 
during  linked-enough  analysis.  On  the  whole, 
this  is  a  weaker  restriction  than  that  of  ij~ 
determinacy  used  in  GOLEM,  since  determi- 
nacy  is  not  supposed,  and  there  is  no  explicit 
depth  limit  (i)  for  the  variables  occuring  in 
a  hypothesis,  so  that  iNCY  is  able  to  learn 
in  domains  where  GOLEM  isn’t.  FOIL’s  in¬ 
formation  g2dn  heuristic  ensures  that  only 
linked  conjuncts  are  added  to  the  premise 
during  specialization,  but  accepts  hypotheses 
with  free  variables,  so  that  its  implicitly  de¬ 
fined  language  cannot  be  compared  directly 
to  Cns-  Some  tests  indicate  that  FOIL’s 
heuristic  does  not  do  well  with  sparse  data 
because  there  is  little  information  gain  to 
work  with.  In  the  presence  of  a  few  neg¬ 
ative  examples,  FOIL  can  no  longer  apply 
the  closed  world  assumption  and  finds  non- 
generative  rules  which  do  not  cover  any  of  the 
examples®.  iNCY  does  not  make  the  closed- 
world  assumption,  so  that  its  results  are  not 
affected. 

The  main  difference  between  FOIL  and  iNCY, 
as  far  as  results  are  concerned,  is  that  the 
rules  discovered  are  quite  different  struc¬ 
turally,  and  that  iNCY  discovers  far  more 
rules.  This  may  or  may  not  be  deemed  an 
advantage,  as  Incy’s  rules  tend  to  be  more  re¬ 
dundant  than  foil’s  (several  rules  cover  an 
example).  Among  these  there  are  often  more 
sensible  ones  -  from  a  knowledge  engineering 
point  of  view  -  than  those  FOIL  homes  in  on. 
When  using  iNCY  as  a  model-generator  for 
induction  algorithms  using  declarative  bias, 
such  as  RDT,  GRENDEL  and  CLINT,  di¬ 
versity  in  models  is  a  plus.  iNCY  learns  re¬ 
cursive  rules  and  rules  with  constants,  and 
hence  models  which  reflect  this.  FOIL  was 
more  affected  by  sparcity  of  examples,  where 
information  gain  has  little  to  work  with,  and 

^Iii  these  rules,  not  ail  variables  were  bound  in  the 
premise.  Mobal’s  inference  engine  does  not  apply 
such  faulty  rules. 
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rules  induced  using  Incy’s  models  scaled  up 
better  when  new  examples  were  incorporated 
into  the  knowlege  base,  but  experiments  in 
other  domains  are  underway  to  corroborate 
this. 
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Abstract 

This  paper  presents  a  method  for  multistrategy 
constructive  induction  tltat  integrates  two 
inferential  learning  strategies — empirical 
induction  and  deduction,  and  two 
computational  methods — data-driven  and 
hypothesis-driven.  The  method  generates 
inductive  hypotheses  in  an  iteratively  modified 
representation  space.  The  operators  modifying 
the  representation  space  are  classified  into 
"constructors,”  which  expand  the  space  (by 
generating  additional  attributes)  and 
"destructors"  which  contract  the  space  (by 
removing  low  relevance  attributes  or 
abstracting  attribute  values).  Constructors 
generate  new  dimensions  (attributes)  by 
analyzing  original  or  transformed  examples 
(data-driven)  and  by  analyzing  the  rules 
obtained  in  the  previous  iteration  (hypothesis- 
driven).  Destructors  detect  the  irrelevant 
components  of  the  representation  space  by  rule- 
bast^  inference  or  statistical  analysis.  The 
method  has  been  implemented  in  the  AQ17- 
MCI  program.  The  preliminary  results  from 
applying  it  to  a  problem  with  noisy  training 
data  and  large  number  of  irrelevant  attributes 
demonstrated  a  superiority  of  the  method  over 
other  constructive  induction  methods  both  in 
terms  of  the  predictive  accuracy,  as  well  as  the 
overall  simplicity  of  the  general  descripticMis. 

Key  words:  multistrategy  learning, 
inductive  inference,  constructive  induction, 
representation  space,  concept  learning. 


1.  Introduction 

Conventional  concept  learning  techniques 
generate  hypotheses  in  the  same  representation 
space  in  which  original  training  examples  are 
presented.  In  many  learning  problems, 
however,  the  original  representation  space  is 
inadequate  for  formulating  for  the  correct 
hypothesis.  This  inadequacy  can  be  evidenced 
by  a  high  degree  of  irregularity  in  the 
distribution  of  instances  of  the  same  class  in  the 
CTiginal  representation  space. 

In  a  situation,  there  exists  a  mismatch  between 
the  complexity  of  concept  boundaries  in  the 
space  and  the  capabilities  of  the  descriptive 
constructs  of  the  representation  language  to 
describe  the  boundaries.  Consequently,  if  the 
boundaries  are  highly  irregular,  typical 
constructs  used  in  learning  systems  will  likely 
be  inadequate  for  representing  them.  Such 
typical  constructs  include  nested  axis-parallel 
hyper-rectangles  (decision  trees),  arbitrary 
axis-parallel  hyper-  rectangles  (conjunctive 
rules  with  internal  disjunction,  as  used  in 
VLl),  hyperplanes  or  higher  degree  surfaces 
(neural  nets),  compositions  of  elementary 
structures  ( grammars),  etc. 

To  address  such  problems,  the  idea  of 
constructive  induction  has  been  introduced 
(Michalski,  1978;  Watanabe  and  Elio,  1987, 
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Matheus  and  Rendell,  1989;  Rendell  and 
Seshu,  1990;  Wnck  and  Michalsld,  1991). 
Constructive  induction  can  be  viewed  as  a 
"double-search"  process,  that  searches  both 
for  a  hypothesis  and  for  an  adequate 
representation  space  in  which  to  express  this 
hypothesis. 

Most  constructive  induction  methods  use  a 
specific  technique  within  one  basic 
computational  method.  Basic  methods  are 
classified  to  data-driven,  hypothesis-driven  and 
knowledge-driven  (Wnek  and  Michalsld,  1991, 
1993).  Recently,  there  has  been  a  trend  toward 
"multistrategy"  constructive  induction 
approaches  that  integrate  several  techniques  and 
methods. 

This  paper  presents  early  results  on  the 
development  of  a  multistrategy  constructive 
induction  system,  AQ17-MCI,  that  aims  at 
integrating  a  wide  range  of  constructive 
induction  techniques  and  methods.  The  basic 
ideas  and  the  architecture  of  the  system  are 
based  on  the  Inferential  Theory  of  Learning 
(TTL),  proposed  by  (Michalsld,  1992).  In  ITL, 
learning  is  viewed  as  a  "goal-directed  process 
of  modifying  the  learner’s  knowledge  by 
e}q)loring  the  learner’s  experience." 

As  mentioned  above,  a  constructive  induction 
learner  performs  two  types  of  searches — a 
search  for  an  inductive  hypothesis  and  a  search 
for  an  adequate  representation  space  in  which 
the  hypothesis  is  represented.  These  two  types 
of  searches  require  different  types  of  search 
operators. 

The  search  for  a  hypothesis  applies  operators 
provided  by  the  given  inductive  learning 
method.  For  example,  the  AQ17-MCI  method 
(briefly,  MCI)  uses  operators  employed  in  the 
AQ-type  learning  systems,  such  as  "dropping 


conditions,"  "extension  against,"  "adding  an 
alternative,"  "closing  interval,"  and  "climbing 
a  generalization  tree." 

The  representation  space  search  operators 
modify  the  representation  space.  The  AQ17- 
MCI  method  uses  both  "constructors,"  that 
expand  the  space  by  adding  new  dimensions 
(attributes),  "destructors"  that  contract  the 
space  by  removing  less  relevant  attributes 
and/or  abstracting  values  of  some  attributes. 

To  perform  a  representation  space  search, 
meta-operators  are  introduced  that  allow  the 
system  to  suggest  different  representation  space 
search  operators  and  methods  ("constructive 
induction  strategies").  Using  the  ITL 
framework,  the  selection  of  constructive 
induction  strategies  is  done  by  applying  the 
operator  selection  rules  based  on  the  evaluation 
of  hypotheses  generated  in  consecutive 
iterations  i.e.  by  "exploring  the  learner's 
experience". 

This  paper  describes  several  techniques  and 
methods  for  representation  space  search,  their 
integration  in  AQ17-MCI  system,  and  the 
results  from  testing  the  system  and  comparing 
it  with  several  other  systems.  The  hypothesis 
space  search  is  assumed  to  be  done  by  the 
standard  AQ-type  algorithm. 

2.  Related  Research 

The  MCI  method  is  relevant  to  both  the 
research  in  constructive  induction  and 
multistrategy  learning.  Related  work  includes 
the  system  LAIR  (Watanabe  and  Elio,  1987), 
and  "Principled  Constructive  Induction" 
(Mehra,  Rendell  and  Benjamin,  1989).  LAIR 
uses  domain-specific  background  knowledge  to 
construct  new  attributes.  Principled 
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constructive  induction  uses  geometric 
interpretations  of  various  ccxistructors  to  guide 
their  selection.  Neither  of  these  approaches, 
however,  possesses  the  wide  range  of 
constructors  and  destructors  available  in  MCI. 

Other  related  systems  are  STAGGER  that 
integrates  techniques  for  Boolean,  numerical 
and  weight  learning  (Schlimmer,  1987). 
GABIL,  for  adaptive  strategy  selection,  based 
on  classification  performance  (Spears  and 
Gordon,  1991),  and  MBAC,  which  uses 
parabolic  models  of  strategy  performance  for 
strategy  selection  (Holder,  1991).  The  strategy 
selection  in  MCI  also  draws  inspiration  from 
research  on  the  development  of  large  scale 
inference  systems,  especially  INLEN 
(Kaufman,  Michalski  and  Kerschberg,  1991). 
In  INLEN,  knowledge  acquisition  or  discovery 
is  based  on  learning  rules  from  expert-supplied 
examples.  The  automated  acquisition  of  rules  is 
most  suitable  to  areas  where  expertise  is 
difficult  to  quantify,  or  where  rules  may  need 
to  be  modified  often,  such  as  in  the  case  of 
strategy  selection  for  constructive  induction. 

Several  systems  have  been  developed  that 
exhibit  constructive  induction  capabilities. 
Some  of  the  earliest  were  INDUCE  (Michalski, 
1980)  and  LEX  (Mitchell,  Utgoff  and  Baneiji, 
1983).  Many  systems  are  based  either  on  an 
analysis  of  the  training  data,  i.e.,  data-driven 
systems  (e.g.,  Schlimmer,  1987;  Bloedom  and 
Michalski,  1991),  or  an  analysis  of 
hypotheses,  i.e.,  hypothesis-driven  (Matheus 
and  Rendell,  1989),  (Pagallo  and  Haussler, 
1990),  (Wnek  and  Michalski,  1991, 1993). 

These  techniques  are  not  very  useful  in 
situations  requiring  different  types  of 
knowledge  representation  space  change.  For 
example,  learning  from  complex  and  noisy 
sensory  data  (e.g.,  learning  to  recognize 


textures  or  shapes),  seems  to  require  a  number 
of  different  techniques  for  the  representation 
space  change. 

Given  several  such  techniques,  a  problem 
arises  of  choosing  the  one  that  is  most  fit  for  a 
given  situation.  This  problem  is  somewhat 
analogous  to  the  problem  of  choosing  an 
inductive  learning  method  to  fit  the  given 
problem  at  hand.  Aha  (1992)  ha»  proposed  to 
solve  the  latter  problem  by  using  meta-rules 
diat  link  the  properties  of  training  datasets  with 
various  empirical  inductive  learning  methods. 

To  choose  among  many  representation  space 
modification  operators,  the  MCI  method  uses 
meta-rules  that  link  the  properties  of  the 
training  datasets  and  properties  of  the 
hypotheses  generated  from  these  datasets  with 
the  appropriate  representation  space 
modification  operators. 

3.  The  MCI  Method 

3.1  An  Overview 

The  MCI  method  integrates  a  large  number  of 
different  representation  space  modification 
techniques  that  are  used  to  determine  an 
adequate  representation  space  for  concept 
learning.  The  process  of  concept  learning  itself 
if  done  by  an  AQ-type  inductive  learning 
method. 

A  general  flow  diagram  for  the  MCI  method  is 
shown  in  Figure  1.  The  input  data  are  initially  a 
user-provided  training  dataset  plus  a 
characterization  of  the  initial  representation 
space,  which  includes  a  description  of 
attributes,  their  types  and  their  domains.  The 
training  dataset  is  split  into  a  primary  and  a 
secondary  dataset.  The  primary  training  set  is 
inputted  to  the  Decision  Rule  Generation 
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Figure  1.  A  functional  diagram  of  the  MCI  method. 


module,  which  uses  an  empirical  inductive 
learning  program  (AQ14)  to  generate  general 
concept  descriptions  (rulesets).  The  obtained 
rulesets  are  evaluated  in  terms  of  their 
complexity  and  their  performance  on  the 
secondary  training  set.  Based  on  the  results  of 
this  evaluation,  the  system  decides  either  to 
stop  the  learning  process  (the  obtained  rules  axe 
outputted  as  the  solution),  or  to  move  to  the 
Representation  Space  Modification  module. 
This  decision  is  based  on  special  control  meta¬ 
rules  (Section  3.2.).  The  final  decision  rules 
are  evaluated  on  the  testing  examples  to 
determine  their  performance.  Figure  2  shows 
the  partitioning  of  the  input  examples  into 
different  classes  (primary  and  secondary 
training  examples,  and  testing  examples),  and 
explains  how  they  are  used. 

The  representation  space  modification  is  done 
by  an  application  of  various  constructive 
induction  operators,  acting  as  constructors  or 
destructors.  Once  a  new  representation  space 
has  been  determined,  both  the  primary  and 
secondary  training  dataset  is  refoimulated  into 
this  space,  and  the  process  is  repeated.  The 


next  sections  describe  in  greater  detail  various 
afreets  of  the  above  process. 

3.2  Determining  When  to  Modify  the 
Representation  Space 

The  representation  space  needs  to  be  modified 
if  there  exists  a  mismatch  between  the 
distribution  of  examples  in  the  space  and  the 
capability  of  the  representation  language  to 
adequately  describe  this  distribution.  This 
mismatch  can  be  removed  either  by  developing 
a  learning  algorithm  capable  of  generating  more 
complex  discrimination  surfaces  in  the  given 
representation  space,  or  by  changing  the 
representation  sc  that  simple  discrimination 
surfaces  will  do  the  job.  For  some  problems, 
the  first  approach  is  infeasible. 

The  constructive  induction  approach  is  to 
modify  the  representation  space  to  remove  the 
mismatch.  An  illustrative  example  of  such  a 
mismatch  is  the  "bit  parity  detection"  problem. 
A  description  of  binary  strings  with  this 
property  in  terms  of  the  bit  positions  in  the 
string  is  very  complicated  and  long.  If, 
however,  one  generates  an  additional  attribute 
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Figure  2.  Subsets  of  the  examples  and  their  role. 


(a  dimension  in  the  representation  space) 
"modi"  of  the  sum  of  the  bits  in  the  string,  the 
problem  becomes  trivial.  The  MCI  approach  is 
to  apply  a  wide  range  of  such  operators  for 
representation  space  change  in  order  to 
determine  a  description  space  in  which  it  would 
be  easy  to  find  the  correct  or  approximately 
correct  decision  rules. 

The  problem  arises  of  how  to  detect  the  need 
for  representation  space  change.  The  MCI 
method  solves  this  problem  on  the  basis  of  the 
"quality"  of  descriptions  (rulesets)  generated 
by  the  Decision  Rule  Generation  Module.  The 
"quality"  of  the  obtained  ruleset  is  evaluated  in 
terms  of  its  predictive  accuracy  on  the 
secondary  training  set  and  its  complexity.  If  the 
quality  is  "satisfactory",  according  to  the  user 
or  some  heuristic  criterion,  then  the  process 
stops.  A  description  of  the  dataset  of  examples 
in  terms  of  certain  meta-attributes  is  stored  in 
the  system’s  knowledge  base  to  serve  as  a 
"meta-training  example.” 

Meta-examples  are  used  to  represent  datasets 
that  both  require  some  kind  or  representation 
space  nxxiification  and  those  that  do  not  These 
meta-examples  are  used  to  develop  meta-rules 
guiding  the  decisions  about  the  need  for  the 
representation  space  change.  If  the  rule  quality 


is  unsatisfactory,  the  method  enters  the 
Representation  Space  Modification  module. 

3.3  Determining  How  to  Modify  the 
Representation  Space 

3.3.1  Meta-attributes  and  Meta-rules 

The  representation  space  is  modified  by 
applying  a  variety  of  operators.  These 
operators  include  both  constructors  that  expand 
the  space  and  destructors  that  contract  the  space 
(see  Section  3.3.3).  The  choice  of  the 
operators  is  guided  by  the  meta-rules  that  relate 
the  properties  of  the  example  dataset  and  the 
rule  evaluation  results  on  the  secondary  training 
set  to  the  most  appropriate  operators.  These 
rules  are  initially  provided  by  the  user,  and  later 
improved  through  learning  from  the  meta¬ 
examples  mentioned  below. 

The  meta-examples  are  described  in  terms  of 
meta-attributes.  These  meta-attributes  are 
OTganized  into  four  classes:  those  characterizing 
types  of  the  original  attributes  (numeric, 
multivalued  nominal.  Boolean,  etc.),  those 
characterizing  the  attribute  quality,  such  as  the 
attribute  utility  (Imam  and  Michalski,  1993)  or 
the  entropy  measure  (Quinlan,  1983),  those 
characterizing  the  expected  level  of  quality  of 
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Valu«i 

Expianatioa 

Meta-attiibutes 
detecting  the 
presence  of  various 
types  of  attributes 

Numeric_attributes_present 

Yes,  No 

Yes,  if  data  contains  two  more 
nunoeric  attributes; 

No,  otherwise 

Nommal_aitributes_present 

Yes,  No 

True,  if  data  contains  two  or 
more  multi-valued  nominal 
attributes; 

False,  otherwise 

Boolean_aftributes_present 

Yes,  No 

True,  if  data  contains  two  or 
more  Boolean  attributes; 

False,  otherwise 

Meta-attributes 
characterizing  the 
attribute  quality 

Irrelevant_attributes_present 

Yes,  No 

Yes,  if  data  contains  any 
irrelevant  attributes; 

No,  otherwise 

Attribute_group_quality 

Sufficient, 

Insufficient 

Sufficient,  if  the  minimum 
quality  of  the  set  of  attributes 
is  above  an  assumed 
threshold; 

Insufficient,  otherwise 

Meta-attributes 
estimating  the 
quality  of  examples 

Overprecision 

Yes,  No 

Yes,  if  an  attribute  in  the  given 
set  is  measured  with  an 
excessive  precision. 

No,  otherwise 

Attribute_value_noise_level 

Error  rate  in 
percentage 
(1..100%) 

Teacher-estimated  morratein 
the  measurement  of  the  attribute 
values  in  the  examples 

Classification_noise_level 

Error  rate  in 
percentage 
(1..100%) 

Teacher-estimated  error  rate  in 
the  assignment  of  examples  to 
classes  by  the  teacher 
("mislabeling") 

Meta-attributes 
estimating  ruleset 
performance 

Performance.estimation 

Accuracy  in 
percentage 
(1..100%) 

Predictive  accuracy  of  the  last 
ruleset  generated  from  the 
primary  training  example  set 
and  tested  on  the  secondary 
testing  set. 

Performance_change 

Strongly 

Up,  Up, 

No  change, 
Down, 
Strongly 
Down 

Measures  the  difference  in 
performance  between  the  n* 
ruleset  learned  and  the  n-l^t 
ruleset  learned 

Table  1.  Meta-attributes  for  characterizing  datasets. 


194 


the  examples,  and  those  characterizing  the 
changes  in  the  performance  of  the  generated 
rules  on  the  secondary  training  dataset  Table  1 
presents  a  list  of  meta-attributes.  With  the 
exception  of  Irrclevant_attributes_present  and 
Attribute_group_quality,  which  can  be 
automatically  calculated  in  a  manner  described 
below,  the  values  of  these  meta-attributes  are 
provided  by  the  user. 

a)  Attribute  Type 

The  applicability  of  the  representation  space 
inodification  operators  (for  short,  RSM 
operators)  depends  on  the  type  of  the  attributes. 
For  example,  arithmetic  operators  apply  to 
numeric  attributes,  logical  operators  apply  to 
Boolean  and  multi-valued  nominal  attributes, 
etc.  The  type  of  attributes  for  which  different 
RSM  operators  are  available  are  currently 
numeric,  muld-valued  nominal  and  Boolean. 

b)  Attribute  Quality 

Attribute  quality  measures  the  ability  of  a  single 
attribute  to  discriminate  among  given  classes  of 
examples.  An  attribute  may  contribute 
individually,  or  as  part  of  an  attribute  group. 
Individual  attribute  quality  can  be  measured 
statistically  by  calculating  the  ability  of  an 
attribute  to  partition  the  example  set 
appropriately.  One  such  measure  is  the 
information  gain  used  in  ID3  ((Quinlan,  1983). 

The  value  of  the  meta-attribute 
"Attribute_group_quality"  is  "True"  if  each 
attribute  in  the  given  group  of  attributes  has 
gain  greater  than  a  user-defined  minimum.  This 
meta-attribute  is  useful  for  detecting  situations 
in  which  each  original  attribute  has  some 
relevance,  but  not  very  high,  which  may  be 
suggestive  of  the  need  for  some  multi-argument 


representation  space  operate  (e.g.,  a  modx  of 
the  sum  of  the  values  of  the  attributes). 

The  ccxitribution  of  an  individual  attribute  in  the 
context  of  a  set  of  attributes  can  be  measured 
by  analyzing  rules  generated  from  examples 
described  in  terms  of  these  attributes  (Wnek 
and  Michalski,  1993).  The  meta-attribute 
’Trrelevant_attributes_present"  views  an 
attribute  as  irrelevant  if  this  attribute  is  not 
present  in  the  rules,  or  is  present  only  in  the 
"light"  rules  (rules  associated  with  low  values 
of  t-weight  parameter  the  coverage  of  training 
examples  by  a  rule). 

An  alternative  measure  of  the  individual 
attribute  quality  is  iht  attribute  utility  (Imam 
and  Michalski,  1993).  An  attribute  utility  is  the 
sum  of  the  class  utilities  of  an  attribute.  The 
class  utility  of  an  attribute  is  the  number  of 
classes  whose  attribute  value  set  has  no 
common  values  with  the  value  set  occurring  in 
the  given  class.  An  attribute  is  considered 
irrelevant  if  its  attribute  utility  is  low.  Thus, 
Inelevant_attributesj)resent  is  true  if,  for  any 
attribute  present  in  the  data,  the  utility  of  that 
attribute  is  below  threshold. 

c)  Example  Quality 

The  quality  of  training  examples  is 
charactCTized  in  terms  of  three  meta-attributes. 
The  first  one,  "Overprecision,"  tests  if  a  given 
attribute  is  measured  with  an  excessive 
precision.  In  such  a  situation,  the  valueset  of 
the  attribute  is  reduced,  and  the  values  of  this 
attribute  in  the  examples  are  substituted  by 
more  abstract  values.  The  second  meta¬ 
attribute,  "Attribute_value_noise_level" 
expresses  a  teacher-estimated  error  rate  in  the 
measurement  of  the  attribute  values  in  the 
examples.  The  third  meta-attribute 
"aassification_noise_lever'  expresses  a 
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Constructive  Induction  Operators 


grotq)ing  grouping  removing  removing  removmg  removmg 

Figure  4.  A  hierarchy  of  constructive  induction  operators 


teacher-estimated  error  rate  in  the  assignment  of 
classes  to  examples  to  classes  by  the  teacher 
("mislabeling"). 

Overprecision  is  reduced  by  proper 
quantization  of  the  attributes  (e.g.,  Kerber, 
1992).  Noise  in  the  data  is  reduced  by  filtering 
training  data  through  "heavy"  rules  (with  high- 
weight)  in  the  induced  descriptions 
(Pachowicz,  Bala  and  22iang,  1992). 

d)  Ride  Performance 

There  are  two  meta-attributes  in  this  category: 
"Performance_level"  that  measure  the 
performance  accuracy  of  rules  on  secondary 
training  examples,  and  "performance  change" 
that  expresses  the  change  in  performance  from 
one  rule  generation  iteration  to  the  next  These 
meta-attributes  help  guide  the  selection  of 
representation  space  modifiers  by  detecting 
when  successive  iterations  are  making 
significant  positive  increases  in  rule  quality,  or 
when  the  change  in  quality  has  declined  or 
ceased. 

3.3.2  Applying  Meta-rules  for  Operator 
Selection 

Each  example  dataset  is  characterized  by  a 
vector  of  the  previously  listed  meta-attributes 


and  their  values.  Operator  selection  is  a 
deductive  process  of  applying  previously 
learned  representation  space  modification 
operator  rules  to  these  meta-attribute  vectors. 
This  matching  procedure  calculates  a  degree  of 
match  between  the  meta-example  and  the  RSM 
rules  using  ATEST  (Reinke,  1984). 
Representation  space  modifiers  are  then  ranked 
in  decreasing  ordor  of  match.  If  no  single  RSM 
rule  is  the  top  rule,  then  the  user  is  asked  to 
select.  This  selection  may  be  based  on  the 
user’s  preference  for  different  types  of 
modifications  such  as  arithmetic  constructions 
over  logical  constructions. 

It  may  occur  that  the  same  RSM  operator  is 
repeatedly  selected.  In  other  words  the  search 
stagnates  on  a  local  maximum.  MCI  attempts  to 
prevent  this  by  updating  the  database 
characterization  after  each  ruleset  evaluation. 
Since  the  meta-attributes  are  updated 
continuously,  the  selection  stage  picks  the 
operator  that  best  matches  the  current  database 
characterization.  If  all  available  operators  fail  to 
match  the  description  (i.e.,  the  degree  of  match 
is  below  a  threshold)  then  selection  stops  and 
MCI  evaluates  the  current  ruleset  on  the  testing 
examples.  At  minimum  the  best  performance  of 
MCI  will  be  that  which  is  achieved  when  no 
modifications  are  made  to  the  representation 
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spac9.  In  this  case  the  performance  of  MQ  will 
be  equal  to  just  selective  induction. 

The  set  of  constmctive  induction  operators  can 
be  organized  hierarchically  as  shown  in  Figure 
4.  This  hierarchical  organization  captures  the 
relationships  between  Cl  operates  and  allows 
selection  rules  to  provide  better  guidance  when 
confronted  with  new  domains.  The  current 
MCI  system  has  capabilities  for  both  types  of 
constructors,  logical  attribute  and  logical 
instance  destructors,  and  statistical  attribute- 
value  removal. 

The  system  was  bootstrapped  by  providing 
meta-examples  describing  datasets  for  which 
appropriate  representation  space  modifications 
were  already  determined.  This  was  done  to 
confirm  if  the  resulting  meta-rules  agree  with 
experience.  Descriptions  for  seven  domains 
were  provided  including:  two  monk’s  problems 
(Thrun,  et  al.,  1991).  Congressional  voting 
records  from  1984  (Bloedom  and  Michalski, 
1991),  texture  data  (Pachowicz,  et  al.,  1992), 
artificially  generated  DNF4  functions  and 
multiplexer  11  (Wnek,  1993)  and  finally  wind¬ 
bracing  data  from  a  civil  engineering  domain 
(Arciszewski  et  al.,  1992). 

The  appropriate  RSM  qperamr  for  each  domain 
was  found  experimentally.  These  meta¬ 
examples  were  given  to  AQ14  classified  by  Cl 
method  so  that  strategy  selection  rules  could  be 
learned.  Table  2  shows  the  learned 
representation  space  modifrcation  operator 
selection  rules.  Default  rules  are  used  in  the 
case  of  RSM  operators  that  do  not  yet  have 
meta-examples  in  the  knowledge  base. 

The  degree  of  match  between  an  example  and  a 
rule  is  calculated  using  the  method  of  ATEST 
(Reinke,  1984).  The  degree  of  match  for  all 
meta-rules  matching  to  the  dataset 


characterization  greater  than  threshold  are 
displayed  to  the  user. 

dcLnumeric  e= 

[Numcric_attributes_ptesent  =  Yes]& 
[Attribute_value_n(Mse_level  =  0%] 

dci.boolean  «= 

[NumCTC_attributes_j)resent  =  No]& 
[Nominal_iltributes_ptesent  =  No)& 
[Irrelevant_attributes_present  =  No] 

dci_nominal  ^ 

[N<»ninal_attributesjpresent  =  Yes]  & 
[Attribute_value_noise_level  =  0%] 

hci_rule_groupinge= 

[Attribute_value_noise_level  =  0%] 
[Irrelevant_attributesj)resent  =  Yes] 

rule_based_instance_temoval  <= 

[Overprecision  =  Yes]  & 
[Attribute_value_noise_level  =5%] 

j 

stat_based_attribute_value_removal  <= 
[Overprecisirai  =  Yes]  & 
[Attribute_value_noise_level  =  5%] 

Table  2.  Examples  of  learned  meta-rules  for 
_ representatiCTi  space  modification 

3.3.3  Example  Reformulation 

After  the  representaticm  space  modification  has 
been  selected,  the  training  data  are  reformulated 
in  this  space.  The  generation  module  has  a 
number  of  fundamental  Cl  operators  with 
which  it  can  modify  the  primary  and  secondary 
training  set  These  operators  include  those  used 
by  a  number  of  previous  systems  (Bloedom 
and  Michalski,  1991),  (Pachowicz,  et  al., 
1992),  (Wnek  and  Michalski,  1993).  Some  of 
these  fundamental  operators  have  been  reponed 
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by  others,  notably  Rendell  and  Seshu  (1990). 
The  following  MCI  operators  are  equivalent  to 
the  terms  used  in  Rendell;  attribute  removal 
(projection),  attribute-value  removal 
(puncturing),  and  hypothesis-driven 
constructive  induction  (supeipositioning). 

a.  Attribute  Removal 

Attribute  removal  makes  a  selection  of  a  set  X* 
of  attributes  from  the  original  attribute  set  X.  In 
MCI,  a  logic-based  attribute  removal  is 
performed  based  on  the  quality  of  an  attribute 
(as  described  by  the  meta-attribute 
’Trrelevant_attributes_present”).  The 
irrelevancy  of  an  attribute  is  calculated  by 
analyzing  rules  generated  by  the  Decision  Rule 
Generation  module.  For  each  attribute,  a  sum  is 
calculated  of  the  total  number  of  examples 
covered  by  a  discrinoinant  rule  which  includes 
that  attribute.  Attributes  that  are  irrelevant  will 
be  useful  only  to  explain  instances  that  are 
distant  from  the  majority  of  examples  in  the 
distribution.  Thus,  these  attributes  will  have 
low  total-weight  sums.  Lx>gic-based  attribute 
removal  is  performed  in  MQ  by  AQ17-HCI. 

b.  Attribute-value  Modification 

Attribute-value  modification  can  be  either  the 
addition,  (concretion)  of  values  to  an  existing 
attribute  domain,  or  the  deletion  (abstraction)  of 
attribute  values.  Currently  MCI  implements 
only  abstraction,  based  on  the  chi-square 
correlation  between  an  attribute-value  interval 
and  the  class.  Using  chi-square  to  quantize  data 
was  first  proposed  by  Kerber  (Kerber,  1992). 
Attribute  value  modification  (AVM)  selects  a 
set  V  c  V  (where  V  is  the  domain  of  A)  of 
allowable  values  for  attribute  A.  AVM  can  be 
used  to  reduce  multi-valued  nominal  domains, 
or  real-valued  continuous  data  into  useful 
discrete,  values.  Discretization  is  especially 


important  for  empirical  induction  methods  that 
allow  only  small  number  of  discrete  attribute 
values  such  as  IDS  (Quinlan,  1983)  and  AQ 
(Michalski,  1983a).  In  MCI,  statistic-based 
attribute-value  removal  is  performed  by  a  chi- 
square  based  method. 

c.  Hypothesis-driven  Cl 

Hypothesis-driven  Cl  (HCI)  is  a  method  for 
constructing  new  attributes  based  on  an 
analysis  of  inductive  hypotheses.  Useful 
concepts  in  the  rules  can  be  extracted  and  used 
to  define  new  attributes.  These  new  attributes 
are  useful  because  explicitly  express  hidden 
relationships  in  the  data.  This  method  of 
hypothesis  analysis  as  a  means  of  constructing 
new  attributes  is  detailed  in  a  number  of  places 
including  (Wnek,  1993;  Wnek  and  Michalski, 
to  appear  1993).  Wnek  and  Michalski  define  a 
hierarchy  of  hypothesis  patterns  from  the 
simplest  (value-groupings)  to  die  most  con^lex 
(rule-groupings),  which  is  implemented  in 
AQ17-HCI.  AQ17-HCI  is  used  in  MCI  to 
perform  rule-based  constructions  of  attributes 
based  on  value-groupings,  condition  groupings 
and  rule-groupings,  and  attribute  removal  (see 
section  a). 

d.  Data-driven  Cl 

Data-driven  (DCI)  methods  build  new  attributes 
based  on  an  analysis  of  the  training  data.  One 
such  method  is  AQ17-DCI  (Bloedom  and 
Michalski,  1991).  In  AQ17-DCI  new  attributes 
are  constructed  based  on  a  generate  and  test 
method  using  generic  domain-independent 
arithmetic  and  boolean  operators.  In  addition  to 
simple  binary  application  of  arithmetic 
operators  including  +,  -,  *,  and  integer 
division,  there  are  multi-argument  functions 
such  as  maximum  value,  minimum  value, 
average  value,  most-common  value,  least- 
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common  value,  and  #VarEQ(x)  (a  cardinality 
function  which  counts  the  number  of  attributes 
in  an  instance  that  take  the  value  x).  Another 
multi-argument  operator  is  the  boolean 
counting  operator.  This  operator  takes  a  vector 
of  m  boolean-valued  attributes  (m>=2)  and 
counts  the  number  of  true  values  for  a 
particular  instance.  This  approach  is  able  to 
capture  m-of-n  type  concepts.  Data-based 
logical  construction  in  MCI  is  performed  by 
AQ17-DCI  using  the  multi-argument  functions 
of  #VarEQ(x),  most-common,  least-common, 
boolean  counting,  and  binary  boolean 
operators.  Data-based  arithmetic  construcdon  is 
performed  by  AQ17-DCI  through  maximum, 
minimum,  average,  and  +,  -,  ♦  and  integer 
division. 

e.  Instance  Removal 

Instance  removal  (IR)  methods  detect  and  filter 
noisy,  or  misclassified  training  examples.  The 
method  used  in  MQ  is  a  logic-based  approach 
implemented  in  AQ-NT  (Pachowicz,  et  al., 
1992).  The  IR  operator  removes  instances  from 
the  training  data  if  they  are  covered  by  'light' 
disjuncts.  Light  disjuncts  are  those  disjuncts  in 
the  rule  which  cover  only  a  small  fiactitxi  of  the 
total  number  of  instances  in  the  class.  Thus  if 
the  ratio  of  coveted  instances  to  total  instances 
in  a  class  is  below  some  threshold  the  coveted 
instances  are  removed  fiom  consideration  by 
the  training  data.  The  relationship  between  the 
weight  of  learned  rules  and  the  plausible 
prototypicality  of  examples  was  first  described 
in  the  AQ15-TRUNC  method  (Michalski, 
1983b).  Other  work,  based  on  calculating  the 
statistical  significance  of  individual  instances  is 
done  in  (Holte,  Acker  and  Porter,  1989) 

3.4  Rule  Evaluation 

Once  a  Cl  operator  has  been  selected  and 
applied  to  the  data,  or  as  a  part  of  the  initial 


detection  step,  the  resulting  classification  rules 
must  be  evaluated.  (Figure  1).  Control  is 
returned  to  either  the  representation  space 
modification  module,  or  the  process  stops 
dependent  upon  rule  quality.  Rule  evaluation  is 
based  on  a  numb^  of  criteria.  As  described  in 
(Bergadano,  et  al.,  1988)  the  quality  of  a 
concept  description  may  be  judged  by  three 
criteria:  accuracy,  simplicity  and  cost.  In  their 
approach,  as  in  MCI,  the  user  selects  the 
relative  importance  of  each  of  these  criteria. 

The  predictive  accu,  acy  of  a  rule  set  is  a 
measure  of  the  ability  of  the  rule  set  to  correctly 
classify  examples  that  were  previously  unseen. 
In  MCI  predictive  accuracy  is  tested  using  a 
secondary  training  set.  The  secondary  set  is 
selected  from  the  data  the  learner  has  not  yet 
seen.  Both  primary  and  secondary  data  are  not 
used  for  testing.  Rules  learned  from  the 
primary  training  set,  but  which  perfmn  well  on 
the  secondary  set,  are  also  less  likely  to  be 
overfitted  to  the  original  data.  Predictive 
accuracy  is  measured  as  the  percentage  of 
secondary  training  examples  correctly 
classified. 

Complexity  of  a  ruleset  is  evaluated  by 
counting  the  number  of  rules  in  the  ruleset  and 
the  total  number  of  conditions. 

Cost  is  a  measure  of  the  price  of  evaluating  the 
values  of  variables  used  in  the  description. 
Each  variable  has  an  associated  cost  provided 
by  the  user.  A  parameter  within  the  rule- 
learning  program,  AQ,  can  be  used  to  control 
the  use  of  attributes  in  a  description  based  on 
cost.  For  this  reason  cost  is  not  included  in  the 
quality  calculation  presented  here. 

The  final  quality  of  the  rule  is  evaluated 
lexiographically.  Rulesets  are  evaluated  first 
according  to  the  accuracy  criterion.  If  the 
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accuracy  is  within  a  user  defined  threshold  of 
the  goal  accuracy,  the  ruleset  is  then  further 
evaluated  according  to  the  ccxnplexity  criterion. 
If,  the  ruleset  does  not  naeet  the  nunimum 
standard  for  accuracy  it  is  rejected  and  no 
further  processing  is  done.  The  lexiographic 
evaluation  permits  the  user  to  set  a  constraint 
on  the  minimum  allowable  accuracy. 


Given  set  of  classified  meta-examples,  new 
meta-rules  can  be  learned  or  improved.  Meta¬ 
rules  arc  generated  by  AQ14.  The  new  meta¬ 
rules  generalize  the  previous  meta-examples 
Meta-rules  will  now  be  capable  of  classifying 
unseen  databases  according  their  suitability  to 
representation  space  modification.  Examples  of 
learned  meta-rules  are  presented  in  Table  2. 


3.5  Storing  Experience  of  Operator 
Selection:  Meta-examples 

Each  time  a  strategy  is  selected,  and  evaluated 
against  the  secondary  training  examples  data, 
the  results  of  the  modification  must  be  stored. 
If  the  application  resulted  in  an  improvement  in 
rule  quality,  the  meta-example  characterizing 
the  dataset  is  inserted  into  the  knowledge  base 
under  the  class  representing  RSM  operator 
which  made  the  useful  modification  of  the 
representation  space.  If  the  quality  remained 
constant  or  declined,  the  user  determines  if  the 
meta-example  should  be  stored.  The  problem 
of  learning  meta-rules  which  not  only  link  a 
dataset  to  an  operator,  but  also  make  the 
selection  in  the  context  of  previous  selections  is 
discussed  in  section  6. 


4.  Experiments 

The  MCI  method  was  tested  in  an  artificial 
problem,  an  extension  of  the  difficult  second 
monk's  problem  in  which  both 
misclassification  noise  and  irrelevant  attributes 
are  added.  This  problem,  "Noisy  and  Irrelevant 
Monk2"  (NIM2),  extends  the  difficulty  of  the 
second  Monk’s  problem,  by  including  5% 
random  misclassification  noise  (9  training 
examples)  and  7  irrelevant,  randomly  generated 
attributes  to  the  original  set  of  6.  The  goal 
concept  of  the  NIM2  problem,  like  the  original 
monk  2,  is:  "exactly  two  of  the  6  attributes  take 
their  first  value".  In  the  monk  2  problem,  dci- 
nominal  attribute  construction  modified  the 
training  data  by  adding  a  new  attribute  which 
represented  the  number  of  values  which  take 
their  first  value.  This  naodification  allowed  AQ 


Problem 

Method 

Accuracy 
(Exact  match) 

Complexity 

#Rules 

#Conds 

Noisy  Monk2 
(5%  noise) 

#Classes=:2 

#Attributes=13 

AVD  Size=3 

AQ14  (No  data  modifications) 

47.2% 

37 

327 

AQ14 

(with  stat  attrib.  value  removal) 

46.8% 

43 

236 

AQ17-HCI  (Rule-based  attribute 
construction  and  removal) 

42.1  % 

13 

55 

AQ-NT 

(Rule-based  instance  removal) 

43.1  % 

19 

125 

AQ17-DQ  (Data-driven 
attribute  construction) 

81.5% 

17 

122 

MQ 

90.2% 

8 

23 

AVD-  "attribute  value  domain" 

Table  3.  A  performance  comparison  of  the  MCI  method  with  several  single  strategy  methods. 
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to  find  the  goal  concept  resulting  in  rules  which 
perfectly  stated  the  goal  concept  AQ17-HCI 
also  solved  this  problem  by  constructing  new 
attributes  based  on  "xor-rule-pattems"  (Wnek, 
1993). 

The  NIM2  problem,  however,  is  more 
difficult.  For  this  problem,  dci-nominal 
construction  builds  the  same  attribute,  but  the 
goal  concept  has  been  disrupted  by 
misclassified  examples.  This  results  in  fairly 
accurate,  but  complex  rules  (81%  predictive 
accuracy,  17  rules). 

The  MCI  method  was  also  applied  to  the 
problem.  The  detection  step  was  performed 
with  AQ14  generating  37  rules  with  a 
predictive  accuracy  of  47%.  When  presented 
with  NIM2,  MCI  first  invoked  rule-based 
instance  removal.  Using  AQ-NT  5%  of  the 
training  examples  were  removed,  and  new 
rules  were  learned.  There  were  19  new  rules 
with  an  accuracy  of  43%.  MCI  next  invoked 
dci-nominal.  dci-nominal  constructed  a  new 
attributed  representing  the  number  of  attributes 
which  take  their  first  value.  With  this  new 
attribute,  AQ  was  able  to  generate  12  rules  with 
an  accuracy  of  86%.  When  the  operator 
selection  module  was  invoked  again,  the  meta¬ 
rule  for  dci-nominal  construction  continued  to 
have  the  greatest  match  to  the  meta-example 
describing  the  dataset  When  dci-nominal  was 
invoked  again,  no  new  attributes  were 
constructed-no  representation  space 
modification  was  made-so  the  next  best 
method,  HCI  was  selected  by  the  user. 

In  this  third  representation  modification,  HCI 
added  two  new  attributes  and  deleted  seven 
features  x2,  x3,  x6,  x7,  x8,  xlO  and  xl3. 
When  AQ  was  invoked  on  the  transformed 
database  8  rules  with  only  23  conditions  were 
generated  with  an  accuracy  of  90%.  MCI 


selection  ceased  when  dci-nominal  was  selected 
again,  and  no  new  attributes  were  constructed. 
The  combination  of  rule-based  instance 
removal  (AQ-NT),  Data-driven  Cl  (dci- 
nominal)  and  Hypothesis  driven  Cl  and 
attribute  removal,  produced  a  ruleset  which  has 
significantly  fewer  total  rules  (8  vs.  17), 
significantly  shorter  rules  (23  vs.  122  total 
conditions)  and  which  are  better  performing 
(90%  vs.  81%)  than  the  next  best  single 
strategy  constructive  induction  method  of 
AQ17-DCI.  In  table  4,  MCI  is  compared  to 
AQ14  which  has  not  methods  for  data 
modification,  the  results  of  AQ14  after 
processing  the  data  with  a  chi-square  based 
attribute  value  removal  method,  AQ17-HCI, 
A(5-NTandAQ17-DCI. 

The  problem  of  determining  the  context  of 
operator  selection  decisions  is  a  matter  of  future 
work.  It  is  interesting  to  note  that  when 
different  meta-rules  are  used  (characteristic  vs. 
discriminant),  the  MCI  method  selects  only  dci- 
nominal  construction  and  then  HCI.  The 
resulting  ruleset  is  still  superior,  in  predictive 
accuracy  to  any  single  strategy  method,  but  is 
more  complex  (88.2%  accuracy,  17  rules,  57 
conditions). 

5.  Summary 

This  paper  presented  a  methodology  of 
multistrategy  constructive  induction  that 
integrates  two  inferential  learning  strategies 
-empirical  induction  and  deduction,  and  two 
computational  methods— data-driven  and 
hypothesis-driven.  Empirical  induction  was 
performed  in  the  Rule  Generation  module,  and 
in  the  search  for  appropriate  Representation 
Space  Modifications  (the  double  search  of 
constructive  induction).  £>eduction  was  used  in 
the  application  of  learned  meta-rules  to  the 
characterization  of  incoming  datasets  in  order  to 


select  an  appropriate  representation  space 
modification.  MCI  includes  "constructor"  and 
"destructor"  modifiers.  Modifier  selection  is 
based  on  meta-rules  learned  from  the  results  of 
past  appUcaticxis  of  modifiers. 

The  MCI  approach  was  tested  on  a  problem, 
the  NIM2,  characterized  by  misclassificadon 
noise  and  irrelevant  attributes.  The  MCI 
method  produced  rules  which  surpassed  not 
only  traditional  selective  induction  learning  (no 
representational  modifications),  but  also  single 
strategy  methods  in  terms  of  the  quality  of  rules 
produced.  The  quality  of  the  resulting  ruleset 
was  superior  both  in  terms  of  predictive 
accuracy  on  the  testing  examples,  and 
complexity. 

6.  Future  Work 

One  important  area  of  improvement  of  the 
current  method  is  the  determination  of  a  good 
criterion  when  to  stop  applying  representation 
space  modification  (RSM)  operators.  In 
general,  rule  quality  changes  when 
representation  space  modidcations  are  made. 
Currently,  RSM  operator  selection,  application 
and  evaluation  is  repeated  until  the  user  is 
satisfied  with  the  ctnrent  ruleset  quality.  But  if 
the  user  is  not  satisfied,  and  the  change  in  the 
rule  quality  has  been  negative,  the  question 
arises  as  to  whether  the  system  should  not 
recommend  to  the  user  some  new  ways  of 
continuing  the  search  process. 

Such  a  decision  should  be  based  on  a  new  type 
of  meta-knowledge  that  keeps  track  of  which 
RSM  operators  have  been  tried  so  far,  and 
which  have  not  The  meta-attribute  set  must 
capture  this  knowledge,  and  the  matching 
algorithm  must  support  a  more  sophisticated 
concept  of  context  and  the  sequence  of  RSM 


operators  before  this  process  can  be  completely 
automated. 

Constructive  induction  is  a  knowledge 
intensive  learning  process.  Further  research 
should  provide  even  more  advanced  capabilites 
for  introducing  and  employing  domain 
knowledge  to  guide  constructive  induction 
(Ragavan  and  Rendell,  1991).  For  example, 
there  should  be  a  facility  for  a  user  to  indicate 
different  preferences  for  various  types  of 
constructive  induction  operators.  Also,  it 
should  be  easy  to  the  user  to  give  advice  as  to 
the  use  of  some  new  type  of  operators. 

This  raises  a  general  issue  of  how  to  include 
within  a  constructive  induction  system 
sophisticated  knowledge  representation 
capabilties.  Consequently,  there  is  a  need  for 
developing  a  general  method  for  what  type  of 
knowledge  should  be  represented  that  might  be 
useful  for  creating  a  mote  adequate  knowledge 
representatitxi,  and  how  it  should  be  used. 

AQ17-MCI  uses  rule-based  knowledge 
representation  system.  An  interesting  issue  is  to 
investigate  how  various  ideas  and  operators 
implemented  in  AQ17-MCI  could  be  employed 
in  learning  systems  using  different  knowledge 
representation,  e.g.,  decision  tress,  semantic 
network,  neural  nets,  etc.  To  employ  any  type 
of  modification  operator  withing  another 
representation  language  will  need  to  deal  with 
the  problems  already  addressed  here,  such  as 
detection  and  reduction  of  the  oveiprecision  of 
data,  noise  in  the  training  data,  or  low  quality 
data  (e.g.,  many  irrelevant  attributes).  It  is 
believed  that  the  same  cues  used  to  select 
transformations  relevant  to  a  rule-based 
representation  will  be  useful  for  other 
representation  languages. 
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The  above  raises  a  general  issue  of  developing 
a  constructive  induction  learning  system  that 
employes  multi-type  representation  language. 
This  would  allow  the  system  to  represent 
different  types  of  knowledge  in  die  form  that  is 
most  suitable  to  them. 
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Abstract 

A  distinct  adveintage  of  symbolic  learning  al¬ 
gorithms  over  artificial  neural  networks  is 
that  typically  the  concept  representations 
they  form  are  more  easily  understood  by  hu¬ 
mans.  A  multistrategy  approach  to  under¬ 
standing  the  representations  formed  by  neu¬ 
ral  networks  is  to  extract  symbolic  rules  from 
trained  networks.  In  this  paper  we  describe 
and  investigate  an  approach  for  extracting 
niles  from  networks  that  uses  the  No  fM  ex¬ 
traction  algorithm  and  the  network  training 
method  of  soft  weight-sharing.  The  NofM 
algorithm  had  previously  been  successfully 
applied  only  to  knowledge-based  neural  net¬ 
works.  Our  experiments  demonstrate  that 
our  extracted  rules  gener2dize  better  than 
rules  learned  using  the  C4.5  algorithm.  In 
addition  to  being  accurate,  our  extracted 
rules  are  also  reasonably  comprehensible. 

Keywords:  artificial  neural  networks, 
rule  extraction,  empirical  comparison 

1  Introduction 

Artificial  neural  networks  (ANNs)  have  been 
successfully  applied  to  real-world  problems 
as  varied  as  steering  a  motor  vehicle  (Pomer- 
leau,  1991)  and  learning  to  pronounce  En¬ 
glish  text  (Sejnowski  &  Rosenberg,  1987).  In 


addition  to  these  practical  successes,  several 
empirical  studies  have  concluded  that  neu¬ 
ral  networks  provide  performance  compara¬ 
ble  to,  and  in  some  cases,  better  than  com¬ 
mon  symbolic  learning  algorithms  (Fisher 
&  McKusick,  1989;  Mooney  et  al.,  1989; 
Weiss  &  Kapouleas,  1989).  A  distinct  advan¬ 
tage  of  symbolic  learning  algorithms,  how¬ 
ever,  is  that  the  concept  representations 
they  form  cure  usually  more  easily  understood 
by  humans  than  the  representations  formed 
by  neural  networks.  In  this  paper  we  de¬ 
scribe  and  investigate  a  multistrategy  ap¬ 
proach  to  inductive  learning  that  involves  ex¬ 
tracting  symbolic  rules  from  trained  neural 
networks.  Our  approach  uses  the  NofM  al¬ 
gorithm  (Towell  &  Shavlik,  1991)  to  extract 
rules  from  networks  that  have  been  trained 
using  Nowlan  and  Hinton’s  method  (1992) 
of  soft  weight-sharing.  We  present  experi¬ 
ments  that  demonstrate  that,  for  two  diffi¬ 
cult  learning  tasks,  our  method  learns  niles 
that  are  more  accurate  than  rules  induced  by 
Quinlan’s  C4.5  algorithm  (1993).  Further¬ 
more,  the  rules  that  are  extracted  from  our 
trained  networks  are  comparable  to  rules  in¬ 
duced  by  C4.5  in  terms  of  complexity  and 
understandability. 

Towell  and  Shavlik  (1991)  demonstrated 
that  concise  and  accurate  symbolic  rules 
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can  be  extracted  from  knowledge-based  neu¬ 
ral  networks.  In  a  knowledge- based  net¬ 
work,  the  topology  and  initial  weights  of 
the  network  are  specified  by  a  domain  the¬ 
ory  consisting  of  symbolic  inference  rules. 
Since  these  networks  initially  encode  sym¬ 
bolic  rules,  training  is  more  a  process  of 
rule  refinement  than  of  tabula  rasa  learning. 
This  paper  describes  work  that  involves  us¬ 
ing  Towell  and  Shavlik’s  NofM  algorithm 
to  extract  rules  from  ANNs  which  have  not 
been  initialized  by  a  domain  theory.  Be¬ 
cause  the  NofM  algorithm  assumes  that 
the  weights  in  a  trained  network  are  clus¬ 
tered,  we  modify  the  training  process  to  en¬ 
courage  such  a  network  state  after  training. 
Previously,  Towell  (1991)  reported  that  the 
NofM  algorithm  failed  to  extract  accurate 
rules  from  conventional  networks. 

We  use  two  problem  domains  to  investi¬ 
gate  the  effectiveness  of  our  approach.  The 
first  domain  involves  recognizing  promoters 
in  DNA  (Towell  et  al.,  1990).  Promoters  are 
short  nucleotide  sequences  that  occurs  be¬ 
fore  genes  and  serve  as  binding  sites  for  the 
protein  RNA  polymerase  during  gene  tran¬ 
scription.  Identifying  promoters  is  an  im¬ 
portant  step  in  locating  genes  in  DNA  se¬ 
quences.  The  second  problem  domain  that 
we  investigate  is  a  simplified  version  of  the 
Nettalk  task  of  mapping  English  text  to 
its  pronunciation  (Sejnowski  &  Rosenberg, 
1987).  Our  scaled-down  version  of  this  do¬ 
main  involves  learning  only  the  stresses  (but 
not  the  phonemes)  from  a  corpus  of  the  1000 
most  common  English  words. 

The  organization  of  this  paper  is  as  fol¬ 
lows:  the  next  section  discusses  the  problem 
of  extracting  rules  from  neural  networks  and 
describes  the  NofM  algorithm  that  is  em¬ 
ployed  in  our  approach.  Section  3  describes 
soft  weight-sharing  and  how  we  use  for  the 
task  of  rule  extraction.  Section  4  describes 
two  problem  domains  that  we  use  to  inves¬ 
tigate  the  effectiveness  of  our  approach,  and 


section  5  presents  experimental  results  for 
these  domains.  The  fined  section  provides 
conclusions  and  a  discussion  of  future  work. 

2  Extracting  Rules  From  Neu¬ 
ral  Networks 

An  important  criterion  by  which  a  ma¬ 
chine  learning  algorithm  should  be  judged  is 
the  comprehensibility  of  the  representations 
formed  by  the  algorithm.  That  is,  does  the 
algorithm  encode  the  information  it  learns  in 
such  a  way  that  it  may  be  inspected  and  un¬ 
derstood  by  humans?  There  are  at  least  five 
reasons  why  this  is  an  important  criterion. 

•  Validation.  If  the  designers  and  end- 
users  of  a  learning  system  are  to  be  con¬ 
fident  in  the  performance  of  the  system, 
then  they  must  understand  how  it  ar¬ 
rives  at  its  decisions. 

•  Discovery.  Learning  algorithms  may 
discover  salient  features  in  the  input 
data  whose  importance  was  not  previ¬ 
ously  recognized.  If  the  representations 
formed  by  the  algorithm  are  comprehen¬ 
sible,  then  these  discoveries  can  be  made 
accessible  to  hurnem  review. 

•  Explanation.  If  the  representations  are 
understandable,  then  an  explanation  of 
the  classification  made  on  a  particular 
case  can  be  garnered. 

•  Improving  generalization.  The  feature 
representation  used  for  an  inductive 
leaning  tjisk  can  have  a  significant 
impact  on  generalization  performance. 
Understanding  learned  concept  repre¬ 
sentations  may  facilitate  the  design  of  a 
better  feature  representation  for  a  given 
problem. 

•  Refinement.  Some  researchers  use  in¬ 
ductive  learning  systems  to  refine  an 
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approximately-correct  domain  theory 
(Ourston  &  Mooney,  1990;  Towell  et  al., 
1990).  When  a  learning  system  is  used 
in  this  way,  it  is  important  to  under¬ 
stand  the  changes  to  the  knowledge  base 
that  have  been  imparted  during  the 
training  process. 

2.1  Rule  Extraction  Methods 

A  significant  limitation  of  artificial  neural 
networks  is  that  the  concepts  they  learn 
are  virtually  impenetrable  to  human  under¬ 
standing  because  concepts  are  represented 
by  a  large  number  of  real-valued  param¬ 
eters:  the  weights  and  biases  of  the  net¬ 
work.  One  approach  to  understanding  the 
representations  formed  by  a  neural  network 
is  to  use  scientific  visualization  techniques 
(Wejchert  &  Tesauro,  1990).  A  second  ap¬ 
proach,  which  is  applicable  to  small  feature 
spaces,  involves  a  combination  of  visualiza¬ 
tion  and  rule  extraction  (Wnek  fe  Michalski, 
1991).  The  approach  on  which  we  focus  in 
this  paper  is  the  extraction  of  symbolic  rules 
from  networks  of  arbitrary  size  (Fu,  1991; 
McMillan  et  al.,  1991;  Salto  &  Nakano, 
1988). 

The  imderlying  premise  of  these  rule- 
extraction  methods  is  that  each  hidden  and 
output  unit  in  the  network  can  be  thought  of 
as  implementing  a  symbolic  rule.  The  con¬ 
cept  associated  with  each  unit  is  the  conse¬ 
quent  of  the  rule,  and  certain  subsets  of  the 
units  that  feed  into  this  unit  represent  the 
antecedents  of  the  rule.  As  shown  in  Fig¬ 
ure  1,  the  process  of  rule  extraction  involves 
finding  the  sufficient  conditions  for  each  con¬ 
sequent.  In  order  to  find  such  sets  of  suffi¬ 
cient  conditions,  rule-extraction  methods  ^ls- 
sume  that,  after  training,  hidden  and  out¬ 
put  units  tend  to  be  either  maximally  active 
(i.e.,  have  activation  neax  one),  or  inactive 
(i.e.,  have  activation  near  zero).  Given  this 
assumption,  a  rule-extraction  algorithm  can 


Figure  1 :  Extracting  rules  from  a  unit  in  a  neu¬ 
ral  network.  The  extracted  rules  show  which  combina¬ 
tions  of  antecedent  units  must  be  activated  in  order  for 
the  consequent  unit’s  bias  to  be  exceeded. 

search  for  minimal  sets  of  antecedent  units 
that,  when  maximally  active,  cause  the  con¬ 
sequent  unit  to  become  maximally  active. 
The  process  of  searching  for  rules  is  problem¬ 
atic  because  of  the  combinatorics  involved. 
The  complexity  of  this  search,  in  the  worst 
case,  is  0(2")  where  n  is  the  number  of  con¬ 
nections  impinging  on  the  consequent  unit. 
Moreover,  these  algorithms  tend  to  extract 
a  large  number  of  rules,  even  for  networks  of 
moderate  complexity. 

2.2  The  NofM  Algorithm 

Towell  and  Shavlik  previously  described  an 
algorithm,  called  NofM,  that  avoids  the 
combinatoric  and  rule-set  size  problems  of 
other  rule-extraction  algorithms  by  cluster¬ 
ing  weights  into  equivcJence  classes.  They 
have  demonstrated  that  their  NofM  algo¬ 
rithm  is  able  to  extract  accurate  and  con¬ 
cise  rules  from  trained  knowledge-based  neu¬ 
ral  networks;  that  is,  networks  for  which  the 
topology  and  initial  weights  have  been  spec¬ 
ified  by  an  approximately-correct  domain 
theory.  The  algorithm  is  called  NofM  be- 
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cause  it  explicitly  searches  for  rules  of  the 
form: 

If  (A/  of  the  M  antecedents  are  true)  then  ... 

The  NofM  algorithm  comprises  six  steps: 

1.  Clustering.  The  weights  impinging 
on  each  hidden  and  output  unit  of  the 
trained  network  are  grouped  into  clus¬ 
ters.  Initially,  each  weight  is  treated  as 
a  cluster.  The  two  nearest  clusters  are 
successively  merged  until  no  pair  of  clus¬ 
ters  is  closer  than  a  preselected  distance. 
Additionally,  weights  with  small  magni¬ 
tudes  are  pruned  from  the  network  at 
this  step. 

2.  Averaging.  The  magnitude  of  each 
weight  is  set  to  the  average  value  of  the 
weights  in  its  cluster. 

3.  Eliminating.  Weight  clusters  that  are 
not  needed  in  order  to  correctly  acti¬ 
vate  a  unit  are  eliminated.  Two  elim¬ 
ination  procedures  are  applied:  one  al¬ 
gorithmic  and  one  heuristic.  The  al¬ 
gorithmic  elimination  procedure  iden¬ 
tifies  clusters  that  cannot  have  an  ef¬ 
fect  on  whether  or  not  a  unit’s  bias 
is  exceeded.  The  heuristic  elimination 
step  eliminates  clusters  that  do  not  have 
such  an  effect  for  any  of  the  training  ex¬ 
amples. 

4.  Optimizing.  The  unit  biases  are  re¬ 
trained  to  adapt  the  network  to  the 
changes  that  been  imparted  by  the  pre¬ 
vious  steps. 

5.  Extracting.  Each  hidden  and  out¬ 
put  unit  is  translated  into  a  rule  with 
weighted  antecedents  such  that  the  con¬ 
sequent  is  true  if  the  sum  of  the 
weighted  antecedents  exceeds  the  bi2is. 

6.  Simplifying.  Weights  and  thresholds 
Me  eliminated  and  rules  are  expressed 
in  the  NofM  format. 


Figure  2:  Extracting  rules  using  tl  -'M 
method.  The  dotted  ovak  illustrate  how  tt  its 

have  been  grouped  into  clusters.  Each  weight  been 
set  to  the  average  value  of  its  cluster. 

Figure  2  illustrates  the  application  of  the 
NofM  to  the  unit  shown  in  Figure  1.  The 
weights  have  been  grouped  into  two  clusters, 
and  each  weight  has  been  set  to  the  average 
value  of  its  cluster.  One  of  the  extracted 
rules  is  expressed  in  the  NofM  format;  the 
other  two  rules  are  trivial  NofM  cases  (1  of 
1).  The  eliminating  and  optimizing  steps  aure 
not  depicted  in  this  example. 

3  Extending  NofM  With  Soft 
Weight-Sharing 

An  underlying  assumption  of  the  NofM 
method  is  that  the  distribution  of  weights 
in  the  network  will  be  conducive  to  forming 
a  small  number  of  clusters  for  each  hidden 
and  output  unit.  For  knowledge-based  neu¬ 
ral  networks,  this  is  a  reasonable  assumption 
since  the  weights  are  clustered  at  least  before 
training.  For  example,  using  the  KbaNN 
algorithm  for  mapping  symbolic  rules  into 
a  knowledge-based  network  (Towell  et  ad., 
1990),  the  weights  that  axe  specified  by  the 
domain  theory  have  values  of  approximately 

4  and  -4,  whereas  the  rest  of  the  weights 
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have  values  near  0.  Experimental  evidence 
indicates  that  the  weights  tend  to  be  fairly 
well  clustered  after  training  as  well  (Towell, 
1991). 

The  applicability  of  the  NofM  method 
might  seem  to  be  limited  to  knowledge-based 
networks  since  in  conventional  neural  net¬ 
works  there  is  usually  not  a  bias  that  leads 
weight  v<ilues  to  be  clustered  after  training. 
In  fact,  Towell  (1991)  reported  that  NofM 
did  not  extract  sm2dl  sets  of  accurate  rules 
from  conventional  networks.  However,  the 
approach  that  we  explore  in  this  paper  does 
not  rely  on  the  network  weights  being  ini- 
tiedly  clustered,  but  instead  encourages  clus¬ 
tering  during  network  training.  We  use  a 
method  developed  by  Nowlan  and  Hinton 
(1992),  termed  soft  weight-sharing,  that  en¬ 
courages  weights  to  form  clusters  during  the 
training  process.  Although  their  method  was 
motivated  by  the  desire  for  better  general¬ 
ization,  we  explore  it  here  as  a  means  for 
facilitating  rule  extraction. 

Soft  weight-sharing  uses  a  cost  function 
that  penalizes  network  complexity  so  that 
during  training,  the  network  tries  to  find 
an  optimal  tradeoff  between  data-misfit  (i.e., 
the  error  rate  on  the  training  examples)  and 
complexity.  The  complexity  term  in  soft 
weight-sharing  models  the  distribution  of 
weights  in  the  network  as  a  mixture  of  multi¬ 
ple  Gaussians.  A  set  of  weights  is  considered 
to  be  simple  if  the  weights  have  high  prob¬ 
ability  densities  under  the  mixture  model. 
Specifically,  the  cost  ftmction  in  soft  weight 
sharing  is  the  following: 


C  =  XE-Y^\og 


L  j 


where  E  is  the  data-misfit  term,  A  is  a  pa¬ 
rameter  used  to  balance  the  tradeoff  between 
data  misfit  and  complexity,  u;,  is  a  weight  in 
the  network,  Pj(iUt)  is  the  density  value  of  voi 
under  the  jth  Gaussian,  and  tTj  is  the  mix¬ 


ing  proportion  of  the  jth  Gaussian.  A  mix¬ 
ing  proportion  is  a  weight  that  determines 
the  influence  of  a  particular  Gaussian.  The 
mixing  proportions  are  constrained  to  sum 
to  1. 

The  partial  derivative  of  the  cost  function 
with  respect  to  each  weight  is  the  sum  of  the 
usual  error  derivative  plus  a  term  due  to  the 
complexity  cost  of  the  weight: 


dC 

dwi 


,dE  v-_ -^0 


Here  Hj  and  <t|  are  the  mean  and  variance, 
respectively,  of  the  jth  Gaussian,  and  rj(uj,) 
is  the  conditional  probability  that  tv,  is  being 
modelled  by  the  jth  Gaussian: 


rj(u;.)  = 


E*  ^kPk{wi) 


Thus,  the  effect  of  each  Gaussian  is  to  pull 
each  weight  toward  the  mean  of  the  Gaus¬ 
sian  with  a  force  proportional  to  the  density 
of  the  Gaussian  at  the  value  of  the  weight. 
When  weights  are  pulled  tightly  around  the 
means  of  the  Gaussians,  the  network  is  sim¬ 
ilar  to  one  that  has  fewer  free  peirameters 
than  connections  (ordinary  weight  sharing). 
The  parameters  of  each  Gaussian  -  the  mean 
fij,  standard  deviation  aj,  and  mixing  pro¬ 
portion  TTj  -  are  learned  simultaneously  with 
the  weights  dmring  training. 

Our  approach  to  rule  extraction  involves 
training  networks  using  a  variant  of  soft 
weight-sharing  and  then  applying  the  NofM 
algorithm  to  the  trained  networks.  Al¬ 
though  the  NofM  method  was  designed 
for  knowledge-based  neural  networks,  we 
hypothesized  that  it  could  be  successfully 
applied  to  conventional  networks,  provided 
that  the  weights  of  the  networks  were 
grouped  into  clusters  during  training. 

Whereas  the  NofM  algorithm  works  best 
when  the  weights  impinging  on  each  unit 
form  clusters,  soft  weight-sharing  tends  to 
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globally  cluster  network  weights.  Our  imple¬ 
mentation  of  soft  weight-sharing,  however, 
assigns  a  local  set  of  Gaussians  to  each  unit. 
The  complexity  cost  of  a  given  weight  is  cal¬ 
culated  with  respect  to  only  the  Gaussians 
associated  with  the  imit  to  which  the  weight 
connects. 

4  Data  Sets 

Our  experiments  address  the  hypothesis  that 
soft  weight-sharing  is  able  to  cluster  the 
network  weights  during  training  such  that 
NofM  is  able  to  extract  a  small  set  of  ac¬ 
curate  niles.  In  order  to  evaluate  the  effec¬ 
tiveness  of  our  approach,  we  use  two  prob¬ 
lem  domains  to  compare  the  accuracy  and 
succinctness  of  our  extracted  rules  against 
rules  induced  by  the  C4.5  algorithm  (Quin¬ 
lan,  1993).  Both  problem  domains  involve 
predicting  a  class  given  a  fixed-length  “win¬ 
dow”  onto  a  string  of  interest.  In  the  case 
of  the  promoter  domain,  the  string  is  a  DNA 
sequence,  and  in  the  NettalK  domain,  the 
string  is  an  English  word  (or  part  of  one). 

The  promoter  data  set  comprises  468 
examples^,  half  of  which  are  positive  exam¬ 
ples  (i.e.,  promoters).  Each  example  has  57 
features  which  represent  the  DNA  sequence. 
A  single  strand  of  DNA  is  a  linear  chain  com¬ 
posed  from  the  four  nucleotides  represented 
by  the  letters  A,  C,  G,  T.  Thus  all  of  the  fea¬ 
tures  for  this  problem  are  nominal  features 
that  can  take  on  the  values  4,  C,G,  T,  or  un- 
knovm.  Each  example  is  a  member  of  one  of 
two  classes,  promoter  or  non-promoter.  Re¬ 
call  that  a  promoter  occurs  before  a  gene  on 
a  DNA  strand.  The  positive  examples  for 
this  data  set  are  aligned  such  that  the  gene 
following  each  promoter  begins  in  the  sev- 

*Note  that  this  data  set  is  larger  than  the  one 
that  was  used  in  (Towell  et  al.,  1990)  and  is  avail¬ 
able  by  anonymoos  ftp  from  the  UC-Irvine  Repository 
of  Machine  Learning  Databases  and  Domain  Theories 
(ftp.ics.uci.eda).  The  promoter  set  in  the  Irvine  database 
contains  only  106  examples. 


enth  position  from  the  right  end  of  the  win¬ 
dow.  Thus  the  leftmost  50  window  positions 
are  labelled  -50  to  -1,  and  the  rightmost 
seven  are  labelled  1  to  7. 

For  the  neural  networks,  a  local  repre¬ 
sentation  is  used  for  the  promoter  features. 
Thus,  for  each  feature  there  are  four  input 
units  -  one  corresponding  to  each  of  the  nu¬ 
cleotides.  When  the  value  of  a  feature  is 
known,  the  input  unit  corresponding  to  the 
value  is  given  an  activation  of  1,  and  the 
other  three  units  for  the  feature  are  given 
activations  of  0.  When  a  feature  value  is 
unknown,  all  four  input  units  are  given  acti¬ 
vations  of  0.25. 

Our  simplified  Nettalk  data  set  consists 
of  5438  examples  taken  from  the  1000  most 
common  English  words.  Each  example  has 
seven  features  which  represent  the  letters  in 
the  input  window.  Each  feature  can  take 
on  one  of  27  values.  There  is  a  value  cor¬ 
responding  to  each  letter  of  the  alphabet, 
and  a  value  to  represent  the  absence  of  a 
letter.  Since  ejw:h  example  is  formed  from 
only  a  single  word,  when  the  window  over- 
h^lngs  a  word,  the  overhanging  window  po¬ 
sitions  are  set  to  the  “space”  value.  In  the 
original  Nettalk  domain,  the  task  involved 
predicting  both  a  phoneme  and  a  stress  for 
each  window  position.  In  our  experiments 
we  have  simplified  the  problem  so  that  the 
classifiers  are  trained  only  to  predict  a  stress 
(out  of  five  disjoint  classes). 

5  Experimental  Results 

In  this  section  we  evaluate  our  approach 
to  rule  extraction  by  comparing  the  accu¬ 
racy  md  comprehensibility  of  rules  extracted 
from  neural  networks  and  rules  learned  us¬ 
ing  the  C4.5  algorithm.  The  comprehensi¬ 
bility  of  a  set  of  rules  is  a  difficult  concept  to 
measure.  We  simply  measure  the  syntactic 
complexity  of  the  rule  sets  and  use  this  as 
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a  proxy  for  complexity.  Specifically,  we  con¬ 
sider  the  number  of  rules  and  antecedents  to 
be  measures  of  syntactic  complexity. 

5.1  The  Promoter  Data  Set 

For  the  promoter  problem,  we  use  a  ten-fold 
cross-validation  methodology^  to  assess  the 
ability  of  our  approach  to  extract  accurate, 
comprehensible  rules  from  trained  networks. 
Our  reported  results  represent  averaged  val¬ 
ues  for  the  ten  runs. 

The  neural  networks  used  for  the  promoter 
domain  have  fully-connected  hidden  units  in 
a  single  layer.  The  number  of  hidden  units 
used  in  each  network  is  determined  by  cross- 
validation  within  the  training  set.  That  is, 
for  each  training  set,  networks  with  20,  25, 
10,  and  no  hidden  units  are  trained,  and 
cross-validation  is  used  to  pick  the  network 
that  is  to  be  trained  on  all  of  the  data  in 
the  training  set.  After  the  number  of  hid¬ 
den  units  is  selected  for  each  network,  a 
similar  cross-validation  procedure  is  used  to 
determine  the  A  parameter  for  soft  weight¬ 
sharing.  We  use  a  conjugate-gradient  learn¬ 
ing  algorithm  to  train  the  weights  and  the 
Gaussian  parameters  of  the  networks. 

Decision  trees  are  induced,  and  rules  ex¬ 
tracted  from  them,  using  Qmnlan’s  C4.5 
algorithm  (1993).  Cross-validation  within 
each  training  set  is  used  to  determine  the 
confidence  levels  for  both  tree  pruning  and 
rule  pruning.  The  confidence  level  se¬ 
lected  for  tree  pruning  does  not  affect  the 
rule  extraction  results  since  the  C4.5  rule- 
induction  program  operates  on  unpruned 
trees  and  performs  its  own  pruning  indepen¬ 
dently  of  the  tree-induction  program.  How¬ 
ever,  we  select  a  confidence  level  for  tree 
pruning  only  so  that  we  obtain  an  accurate 

^In  ten-fold  cross-validation,  the  available  data  is  par¬ 
titioned  into  ten  sets.  Classifiers  are  trained  using  exam¬ 
ples  from  nine  of  the  sets  and  tested  on  exampte  &om 
the  tenth  set.  This  procedure  is  repeated  ten  times  so 
that  each  set  is  used  as  the  testing  set  once. 


Table  1:  Generalization  on  the  promoter  data. 


approach 

%  test  set  error 

C4.5 

decision  trees 

16.9 

rules 

13.5 

ANNs 

networks 

7.9 

rules 

11.1 

Table  2:  Rule-set  sizes  for  the  promoter  data.  • 


approach 

#  rules 

#  antecedents 

C4.5 

ANNs 

23.2 

8.2 

47.3 

119.6 

estimate  of  decision  tree  generali^ation  for 
this  task.  For  each  training  set,  we  test 
confidence  levels  ranging  from  5%  to  95% 
and  separately  select  tree-pruning  amd  rule- 
pruning  levels. 

Table  1  shows  the  test  set  error  rates  on 
the  promoter  data  set  for  the  decision  trees, 
rules  extracted  from  the  trees,  neural  net¬ 
works,  and  rules  extracted  from  the  net¬ 
works.  As  can  be  seen  in  the  table,  neu¬ 
ral  networks  perform  significantly  better  on 
this  task  than  decision  trees  or  the  rules  ex¬ 
tracted  from  them.  Additionadly,  the  perfor¬ 
mance  of  the  symbolic  rules  extracted  from 
the  neural  networks  is  fairly  close  to  the  per¬ 
formance  of  the  networks  themselves,  and 
better  than  the  rules  extracted  from  the  de¬ 
cision  trees.  The  difference  in  error  rates  be¬ 
tween  the  rules  extracted  from  networks  and 
the  C4.5  rules  is  significant  at  the  0.05  level 
using  a  paired,  1-tailed  t-test. 

Table  2  shows  the  average  number  of 
rules  and  antecedents  for  the  rules  extracted 
from  our  networks  and  the  rules  induced  by 
C4.5.  The  rule  sets  extracted  from  the  de¬ 
cision  trees  contain  more  rules  but  fewer  an¬ 
tecedents  than  the  sets  extracted  from  net¬ 
works.  The  additional  complexity  of  the 
rules  extracted  from  networks,  however,  re- 
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suits  in  a  significant  gain  in  accuracy.  More¬ 
over,  the  rules  extracted  from  networks  have 
only  14.6  antecedents  per  rule,  so  we  believe 
that  their  complexity  is  within  the  bounds 
of  what  biologists  can  readily  understand. 

Table  3  shows  a  set  of  rules  extracted 
from  one  of  the  promoter  networks.  In  ad¬ 
dition  to  the  NOFM-style  rules,  this  rule  set 
has  been  expressed  using  a  predicate  we  call 
more-than.  The  more.than  predicate  has 
the  following  form: 

If  iBore_than(Pos-Set,  NegSet) 


Table  3:  Rules  extracted  from  a  promoter  net¬ 
work.  The  predicate  aore.than  returns  true  if  the  num¬ 
ber  of  true  antecedents  in  the  first  set  minus  the  num¬ 
ber  of  true  antecedents  in  the  second  set  is  greater  than 
the  supplied  threshold.  The  notation  #-36  indicates  the 
starting  position  of  a  given  sequence;  in  this  case  the  po¬ 
sition  is  36  nucleotides  before  the  start  of  a  putative  gene. 

Dashes  represent  placeholders,  so  {6-36  ‘ — AG - A’} 

has  only  three  antecedents.  The  letter  S  is  an  amluguity 
code  that  biologists  use  to  represent  (C  V  G). 

proBOter 

hidden_2 , 
not  (hidden.l), 
not  (hidden_4) . 


where  A"  is  an  integer,  and  PosJSet  and 
NegJSet  are  sets  of  positive  and  negated  an¬ 
tecedents  respectively.  The  predicate  re¬ 
turns  true  if  the  number  of  true  antecedents 
in  PosJSet  minus  the  number  of  true  an¬ 
tecedents  in  Neg^et  is  greater  than  N. 
This  predicate  provides  a  succinct  way  of 
expressing  rules  that  have  many  negated 
antecedents.  Without  such  a  predicate, 
negated  antecedents  tend  to  result  in  a  large 
niunber  of  mostly-redundant  rules.  Since 
the  knowledge-based  networks  to  which  the 
NofM  algorithm  was  previously  applied  had 
very  few  negated  antecedents,  this  predicate 
was  not  previously  necessary. 

The  rule  set  shown  in  Table  3  exhibits 
several  interesting  characteristics.  First,  the 
rules  abstract  away  a  significant  amount  of 
the  complexity  of  the  network  from  which 
they  au-e  extrau:ted.  There  axe  only  ten  rules 
amd  a  total  of  106  antecedents.  Six  of  the 
hidden  units  amd  more  than  2400  of  the 
weights  that  were  present  in  the  neural  net¬ 
work  aure  not  represented  in  the  rules.  A  sec¬ 
ond  observation  is  that  the  rules  focus  on 
what  au-e  known  by  biologists  to  be  the  most 
significamt  regions  of  the  DNA  sequence. 
In  particular,  a  domain  theory  developed 
by  Michiel  Noordewier  (Towell  et  al.,  1990) 
identifies  the  -14  to  -7  amd  the  -37  to  -31  re¬ 
gions  as  containing  the  most  important  fea- 


proBOter 

hidd«n_3 , 
not  (hiddun.l), 
not  (hidden_4) . 


proBoter 

hidden_2 , 
hidd«n_3 , 

not  2  of  {hidden. 1,  hidden_4}. 
hidden.! 

5  Bore.thanC  {#-40  ‘C - C-6-C-G’, 

•-13  ‘-SG - •-!‘G>}, 

{•-40  ‘A - T— A - >, 

•-13  ‘-T - T’}  ). 


hidden.! 

not  ({•-40  ‘ - T - »>), 

3  Bore.thanC  {4-40  ‘C - C-G-C-G’, 

•-!3  ‘-SG - •-!‘G’} 

{•-40  ‘A - T--A - 

•-13  '-T - T’>  ). 


hidden.2  :- 

5  Bore.than(  {4-40  ‘A - T-GA-A’, 

•-!3  ‘-T-’>, 

{•-40  ‘C - C-G— 

•-!3  ‘-SG>}  ). 


hidden.2  :- 

{•-40  ‘ - T - ’}. 

3  Bore.thanC  {4-40  ‘A - T-GA-A’, 

•-!3 

{•-40  'C - C-G—’, 

•-!3  ‘-SG’}  ). 


hidden.3  :- 

4  Bore.thanC  {4-40  ‘A - T-GA-AT’, 

•-!3  '-T-A-T’}, 

{•-40  ' - C-G-C-’, 

4-20  ‘G— G’, 

4-13  ‘GSGG’}  ). 


hidden.3  ;- 

{•-40  ‘ - T - ’}, 

2  Bore.thanC  {0-40  'A - T-GA-AT’, 

4-13  ‘ -T-A-T’}, 

{•-40  ‘ - C-G-C-’, 

4-20  ‘G— G’, 

•-!3  ‘GSGG’>  ). 


hidden.4  :- 

4  of  {4-44  ‘G’,  4-40  ‘ - AC-G-C’, 

4-24  ‘A— G’,  •-!3  ‘-SG’>. 
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tures  of  a  promoter.  These  are  termed  the 
contact  regions.  The  rules  extracted  from  all 
of  the  hidden  units  specify  antecedents  pri¬ 
marily  in  these  areas. 

5.2  The  Nettalk  Data  Set 

For  the  Nettalk  domain,  classifiers  are 
trained  on  hcilf  of  the  example  set  and  tested 
on  the  other  half.  Ten  rims  of  this  proce¬ 
dure  are  performed,  and  the  reported  results 
represent  averaged  values.  As  with  the  pro¬ 
moter  domain,  cross-validation  is  used  to  de¬ 
termine  the  A  parameter  and  the  number  of 
hidden  units  for  the  networks,  and  the  confi¬ 
dence  levels  for  pruning  C4.5  trees  and  rules. 

Since  the  NetTALK  domain  involves  five 
classes,  the  rules  extracted  from  trained  net¬ 
works  are  not  necessarily  mutually  exclusive 
and  exhaustive.  In  other  words,  a  given  in¬ 
put  sequence  may  satisfy  more  than  one  of 
the  class  rules,  or  alternatively,  it  may  sat¬ 
isfy  none  of  the  class  rules.  The  C4.5  rule- 
extraction  method  also  faces  this  complica¬ 
tion  when  it  prunes  antecedents  and  rules 
from  its  rule  set.  C4.5  handles  this  problem 
in  two  ways:  (1)  rules  axe  ordered  by  class, 
and  the  first  rule  to  match  a  given  instance 
determines  the  predicted  class;  (2)  a  default 
rule  is  used  to  classify  instances  that  do  not 
satisfy  any  of  the  other  rules.  We  employ  a 
similar  policy  in  classifying  instances  using 
our  network-extracted  rules.  The  rules  are 
ordered  according  to  the  a  priori  probability 
of  each  class,  and  the  default  rule  predicts 
the  most  probable  class. 

Table  4  shows  the  test  set  error  rates  on 
the  Nettalk  data  set  for  the  decision  trees, 
rules  extracted  from  the  trees,  neural  net¬ 
works,  and  rules  extracted  from  the  net¬ 
works.  The  results  in  this  table  indicate  that 
the  neural  networks  and  the  rules  extracted 
from  them  outperform  C4.5  decision  trees 
and  rules.  The  difference  in  error  rates  be¬ 
tween  the  rules  extracted  from  networks  and 


Table  4:  Generalization  on  the  Nettalk  data. 


approach 

%  t^t  set  error 

C4.5 

decision  trees 

19.1 

rules 

20.1 

ANNs 

networks 

13.0 

rules 

17.0 

Table  5:  Rule-set  sizes  for  the  Nettauc  data.  . 


approach 

#  rules 

#  antecedents 

C4.5 

ANNs 

233.5 

17.5 

466.5 

661.9 

the  decision  trees  is  significant  at  the  0.005 
level  using  a  paired,  1-tailed  t-test. 

Table  5  shows  the  average  number  of  rules 
and  antecedents  in  the  extracted  rule  sets. 
The  rule  sets  extracted  from  the  neural  net¬ 
works  contain  fax  fewer  rules  than  the  rules 
generated  from  the  decision  trees,  although 
the  network  rules  have  far  more  antecedents. 

6  Conclusions 

We  have  demonstrated  that  small  sets  of 
accurate,  reasonably  concise  symbolic  rules 
can  be  extracted  from  ordinary  artificial  neu¬ 
ral  networks.  Our  approach  to  this  prob¬ 
lem  involves  exploiting  the  effectiveness  of 
the  NofM  algorithm  by  encouraging  weight 
clustering  during  training.  For  two  difficult 
problem  domains,  recognizing  promoters  in 
DNA,  and  mapping  English  text  to  stress 
patterns,  our  approach  was  able  to  induce 
rules  that  resulted  in  better  generalization 
than  rules  learned  using  the  C4.5  symbolic 
learning  algorithm. 

There  axe  a  number  of  issues  regarding 
our  approach  that  we  plan  to  pursue  in  fur¬ 
ther  research.  One  such  issue  is  adapting 
the  approach  so  that  it  can  extract  concise 
rule  sets  from  networks  that  have  learned 
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distributed  representations.  In  a  distributed 
representation,  each  concept  at  the  hidden- 
unit  level  may  be  encoded  by  the  activations 
of  many  hidden  units,  and  each  unit  may 
play  a  part  in  representing  many  different 
concepts.  Distributed  representations  tend 
to  result  in  rule  sets  that  are  verbose  and  dif¬ 
ficult  to  understand.  The  NofM  algorithm 
makes  the  assumption  that  each  hidden  unit 
corresponds  to  a  meaningful  concept  (which 
is  an  appropriate  aissumption  for  knowledge- 
based  networks),  and  thus  it  searches  for 
rules  by  considering  each  hidden  unit  inde¬ 
pendently.  Our  proposed  approach  involves 
partitioning  the  space  of  hidden  unit  activa¬ 
tions  and  then  searching  for  rules  that  ex¬ 
plain  particular  regions  of  this  space. 

A  second  area  that  we  plan  to  investigate 
in  future  research  is  to  employ  a  weight- 
pruning  method,  such  as  Optimal  Brain 
Damage  (LeCun  et  al.,  1990),  during  learn¬ 
ing.  The  effectiveness  of  the  NofM  al¬ 
gorithm  is  partly  due  to  the  weight  prun¬ 
ing  that  it  performs  during  its  clustering 
step.  The  expected  advantage  of  pruning 
weights  during  the  learning  process  is  that 
the  remaining  weights  are  able  to  adapt  to 
the  changes  imparted  by  the  pruning  opera¬ 
tion.  An  additional  advantage  of  the  Op¬ 
timal  Brain  Damage  technique  is  that  it 
does  not  base  its  pruning  decisions  on  weight 
magnitudes,  but  instead  bases  them  on  the 
second  partial  derivative  of  each  weight  with 
respect  to  the  cost  function.  The  sensitivity 
of  the  cost  function  to  a  given  weight  is  bet¬ 
ter  estimated  by  a  second  derivative  than  by 
a  magnitude. 

Another  area  for  future  research  is  to  in¬ 
vestigate  the  range  of  dom2dns  to  which  our 
approach  can  be  successfully  applied. 

Extracting  accurate,  comprehensible  rules 
from  neural  networks  is  an  important  prob¬ 
lem  in  machine  learning.  We  have  described 
an  approach  that  employs  the  NofM  algo¬ 
rithm  and  soft  weight-sharing,  and  demon¬ 


strated  that  it  is  able  to  extract  accurate, 
comprehensible  rules  from  networks  trained 
on  a  difficult  real-world  problem.  These 
promising  results  indicate  that  the  problem 
of  understanding  representations  learned  by 
artificial  neural  networks  may  be  tractable. 
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Abstract 

The  problem  of  designing  and  refining  task- 
level  strategies  in  an  embedded  multiagent 
setting  is  an  important  unsolved  question.  To 
address  this  problem,  we  have  developed  a 
multistrategy  system  that  combines  two 
learning  methods:  operationalization  of 
high-level  advice  provided  by  a  human  and 
incremental  refinement  by  a  genetic  algo¬ 
rithm.  The  first  method  generates  seed  rules 
for  finer-grained  refinements  by  the  genetic 
algorithm.  Our  multistrategy  learning  system 
is  evaluated  on  two  complex  simulated 
domains  as  well  as  with  a  Nomad  200  robot. 

Key  words:  advice,  operationalize, 
genetic  algorithms 

1  Introduction 

The  problem  of  designing  and  refining  task- 
level  strategies  in  an  embedded  multi-agent 
setting  is  an  important  unsolved  question.  To 
address  this  problem,  we  have  developed  a 
multistrategy  learning  system  that  combines 
two  learning  methods:  operationalization  of 
high-level  advice  provided  by  a  human,  and 
incremental  refinement  by  a  genetic  algo¬ 
rithm  (GA).  We  define  advice  as  a 
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recommendation  to  achieve  a  goal  under  cer¬ 
tain  conditions.  Advice  is  considered  to  be 
operationalized  when  it  is  translated  into 
stimulus-response  rules  in  a  language  directiy 
usable  by  the  agent.  Operationalization  gen¬ 
erates  seed  rules  for  finer-grained 
refinements  by  a  GA. 

The  long  term  goal  of  the  work  proposed  here 
is  to  develop  task-directed  agents  capable  of 
acting,  planning,  and  learning  in  worlds  about 
which  they  have  incomplete  information. 
These  agents  refine  factual  knowledge  of  the 
world  they  inhabit,  as  well  as  strategic 
knowledge  for  achieving  their  tasks,  by 
interacting  with  the  world.  Agent  knowledge 
acquisition  is  very  difficult  for  the  same  rea¬ 
sons  that  knowledge  acquisition  fOT  expert 
systems  is.  It  is  preferable  to  assimilate  high 
level  knowledge  because  the  process  of 
entering  low  level  domain-specific 
knowledge  for  an  agent  is  a  cosdy,  tedious, 
and  error-prone  process.  The  additional  chal¬ 
lenge  for  agent  knowledge  acquisition  comes 
fiom  the  fact  that  the  agent  must  dynamically 
update  its  knowledge  through  interactions 
with  its  environment. 

There  are  two  basic  approaches  to  construct¬ 
ing  agents  for  dynamic  environments.  The 
first  decomposes  the  design  into  stages:  a 
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parametric  design  followed  by  refinement  of 
the  parameter  values  using  feedback  from  the 
world  in  the  context  of  the  task.  Several 
refinement  strategies  have  been  studied  in  the 
literature:  GAs  (Odetayo  and  McGregor, 
1989),  neural-net  learning  (Clouse  and 
Utgoff,  1992),  statistical  learning  (Maes  and 
Brooks,  1990),  and  reinforcement  learning 
(Mahadevan  and  Connell,  1991).  The 
second,  more  ambitious,  approach  (Giefen- 
stette  et  al.,  1990;  Tesauro,  1992)  is  to 
acquire  the  agent  knowledge  directly  from 
example  interactions  with  the  environment. 
The  success  of  this  approach  is  tied  to  the 
efficacy  of  the  credit  assignment  procedures, 
and  whether  or  not  it  is  possible  to  obtain 
good  training  runs  with  a  knowledge- 
impoverished  agent. 

We  have  adopted  the  first  approach.  The 
direction  we  pursue  to  compile  an  initial 
parametric  agent  using  high-level  strategic 
knowledge  (e.g.,  advice)  input  by  the  user,  as 
well  as  a  body  of  general  (not  domain- 
specific)  spatial  knowledge  in  the  form  of  a 
Spatial  Knowledge  Base  (SKB).  The  SKB 
contains  qualitative  rules  about  movement  in 
space.  Example  rules  in  our  SKB  are  "If 
something  is  on  my  side,  and  I  turn  to  the 
other  side,  I  will  not  be  facing  it"  and  "If  I 
move  toward  something  it  will  get  closer". 
This  SKB  is  portable  because  it  is  applicable 
to  a  variety  of  domains  where  qualitative  spa¬ 
tial  knowledge  is  important  A  similar  quali¬ 
tative  knowledge  base  was  constructed  by 
(Mitchell,  1987)  for  the  task  of  pushing 
objects  in  a  plane.  Since  the  knowledge  pro¬ 
vided  to  our  agent  will  often  be  imperfect 
(incomplete  and  incorrect),  this  knowledge  is 
refined  by  a  GA. 

First  we  describe  our  deductive  operationali¬ 
zation  process  and  the  nature  of  the  parame¬ 
terization  adopted  for  our  agent.  Then  we 


describe  the  inductive  (GA)  refinement  stage 
and  compare  our  multistrategy  approach  with 
one  that  is  purely  inductive.  Before  we 
present  the  details  of  the  method,  we  charac¬ 
terize  the  class  of  environments  and  tasks  for 
which  we  have  found  this  decomposition  (of 
an  agent  design  into  an  initial  parametric 
stage  and  subsequent  refinement  stage)  to  be 
effective. 

•  Environment  characteristics:  Complete 
models  of  the  dynamics  of  the  environment 
in  the  form  of  differential  equations  or 
difference  equations,  or  discrete  models 
like  STRIPS  operators,  are  unavailable. 
An  analytical  design  that  maps  the  percepts 
of  an  agent  to  its  actions  (e.g.,  using 
differential  game  theory  or  control  theory) 
in  these  domains  is  not  possible  without  a 
complete  model.  Even  if  a  model  were 
available,  standard  methods  for  deriving 
agents  are  extensional  and  involve  explora¬ 
tion  of  the  entire  state  space.  They  fail 
because  the  domains  considered  in  this 
paper  have  of  the  order  of  a  100  million 
states. 

•  Task  characteristics:  Task  are  sequential 
decision  problems:  payoff  is  obtained  at  the 
end  of  a  sequence  of  actions  and  not  after 
individual  actions.  Examples  are  pursuit- 
evasion  in  a  single  or  multi-pursuer  setting 
and  navigating  in  a  world  with  moving  obs¬ 
tacles.  The  tasks  are  typically  multi¬ 
objective  in  nature:  for  instance,  minimize 
energy  consumption  while  maximizing  the 
time  till  capture  by  the  pursuer. 

•  Agent  characteristics:  The  agent  has 
imperfect  sensors.  Imperfections  occur  in 
the  form  of  noise,  as  well  as  incomplete¬ 
ness  (all  aspects  of  the  state  of  the  world 
cannot  be  sensed  by  our  agent,  a  problem 
called  perceptual  aliasing  in  Whitehead 
and  Ballard,  1990).  Stochastic  differential 
game  theory  has  methods  for  deriving 
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agents  with  noisy  sensors,  but  it  requires 
detailed  models  of  the  noise  as  well  as  a 
detailed  model  of  the  environment  and 
agent  dynamics. 

The  action  set  of  the  agents  and  the  values 
taken  on  by  sensors  are  discrete  and  can  be 
grouped  into  qualitative  equivalence 
classes.  This  is  the  basis  for  the  design  of 
the  parametric  agent.  A  similar  intuition 
underlies  the  design  of  fuzzy  controllers 
that  divide  the  space  of  sensor  values  into  a 
small  set  of  classes  described  by  linguistic 
variables. 

In  such  domains,  human  designers  derive  an 
initial  solution  by  hand  and  use  numerical 
methods  (typically  very  dependent  on  the  ini¬ 
tial  solution)  to  refine  their  solution.  Our  ulti¬ 
mate  objective  is  to  automate  the  derivation 
of  good  initial  solutions  by  using  general 
knowledge  about  the  environment,  task,  and 
agent  characteristics  and  thus  provide  a  better 
starting  point  for  the  refinement  process.  We 
begin  with  the  SKB  and  advice. 

2  Compiling  Advice 

Our  operationalization  method  compiles 
high-level  domain-specific  knowledge  (e.g., 
advice)  and  spatial  knowledge  (SKB)  into 
low-level  reactive  rules  directly  usable  by  the 
agent.  The  compilation  performs  deductive 
concretion  (Michalski,  1991)  because  it 
deductively  converts  abstract  goals  and  other 
knowledge  into  concrete  actions.  An  impor¬ 
tant  question  is  why  we  adopt  a  deductive 
procedure  for  operationalization  of  advice. 
At  this  time,  we  are  able  to  achieve  opera¬ 
tionalization  without  resorting  to  any  form  of 
non-deductive  inference.  This  is  because  for 
the  domains  studied  in  this  paper,  the  SKB  is 
complete  enough  to  yield  good  parametric 
designs  with  deductive  inference  alone.  We 
expect  that  as  we  expand  our  experimental 
studies  to  cover  more  domains,  the 


incompleteness  of  the  SKB  will  force  us  to 
adopt  more  powerful  operationalization 
metiiods. 

We  assume  all  the  knowledge  provided  is 
operationalized  immediately;  howevo-,  it 
need  not  be  applied  immediately.  We 
precompile  all  high-level  knowledge  because 
the  agent  will  apply  it  in  a  time-critical  situa¬ 
tion.  The  learning  cost  prior  to  execution  is 
not  a  concern,  but  the  reaction  time  of  the 
agent  is  critical.  Therefore  it  is  best  to  have 
all  knowledge  in  a  quickly  usable  (opera¬ 
tional)  form.  Compiled  rules  are  fully  opera¬ 
tional,  whereas  advice  and  SKB  rules  have  at 
least  some  nonoperational  elements.  The 
user  specifies  what  is  operational  for  the 
agent. 

Compilation  uses  two  stacks;  a  GoalStack 
and  an  (operational  condition)  OpCondStack. 
Three  types  of  knowledge  are  initially  given 
to  the  compiler:  facts,  nonoperational  rules 
(abbreviated  nonop  rules),  and  advice.  A 
user  can  provide  any  of  the  three.  The  SKB 
has  only  facts  and  nonop  rules.  The  output 
from  compilation  is  a  set  of  op  rules  directly 
usable  (i.e.,  operational)  by  the  agent.  Facts 
have  the  form: 

Predicate(X  i  ,...vX„). 

Nonop  rules  have  the  form: 

IF  cond  AND  ...  AND  cond  <AND  action> 
THEN  goal. 

Anything  in  angle  brackets  is  optional.  The 
portion  preceding  the  "THEN"  is  the  rule 
antecedent,  and  the  portion  following  the 
"THEN"  is  the  rule  consequent.  A  nonop  rule 
consequent  is  a  single  goal.  The  syntax  for  a 
goal  is  "function(Xi,,..Xn)  =  value"  or 
"predicate(X  i  Each  X,-  is  an  object 

(e.g.,  an  agent).  The  syntax  for  a  "cond" 
(condition)  or  "action"  in  the  rule  antecedent 
is  the  same  as  for  goals.  Advice  has  the 
form: 
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<IF  cond  AND  ...  AND  cond  THEN> 
ACHIEVE  goal. 

Although  advice  has  a  similar  syntax  to 
nonop  rules,  its  interpretation  differs.  Advice 
recommends  achieving  the  given  "goal" 
under  the  given  "conds".  A  nonop  rule,  on  the 
other  hand,  states  that  the  given  "goal"  will 
be  achieved  if  the  given  "conds"  (and 
"action")  occur.  Compilation  results  in 
stimulus-response  op  rules  of  the  form: 

IF  cond  AND ...  AND  cond  THEN  action. 

The  conditions  of  an  op  rule  are  sensor  values 
detectable  by  the  agent.  The  action  can  be 
performed  by  the  agent.  Our  compilation 
algorithm  is  in  Figure  1. 

Push  advice  on  GoalStack;  goal  followed  by  con¬ 
ditions. 

Initialize  OpCondStack  to  be  empty  and  invoke 
Compile(GoalStack,OpCondStack). 

Procedure  Compile  (GoalStack,  OpCondStack) 

IF  GoalStack  is  not  empty  THEN 
g  4-  pop(GoalStack); 

CASEg; 

1.  g  is  an  operational  condition: 

Push(g,  OpCondStack); 

Compile(GoalStack,  OpCondStack) 

2.  g  matches  a  faa: 
Compile(GoalStack,OpCondStack) 

3.  g  is  nonoperational: 

FOREACH  nonop  rale  Ri  in 

knowledge  base  whose  consequent 
matches  g  DO 

Push(antecedent(/?,),  GoalStack); 
Compile(GoalStack,OpCondStack)  * 

4.  g  is  an  operational  action 

Form  a  new  op  rale  from  the  contents 
of  OpCondStack  and  g; 

Clear  OpCondStack 
ELSE  Qear  OpCondStack 

FIG.  1.  Algorithm  for  operationalizing  advice. 


'  To  prevent  cycles,  the  last  nonop  rule  used  in  step  3  is 
maiiced  as  "used"  so  that  it  will  not  be  used  again- 


This  algorithm  takes  advice  and  backchains 
through  the  SKB  and  user-provided  nonop 
rules  until  an  operational  action  is  found. 
Once  an  operational  action  is  found,  it  pops 
back  up  the  levels  of  recursion,  attaching 
conditions  along  the  way,  to  form  a  new  reac¬ 
tive  agent  op  rule. 

Let  us  examine  a  simple  example  of  how  this 
algorithm  operates,  as  shown  in  Figures  2  and 
3.  Heading  iX,Y)  refers  to  the  direction  of 
motion  of  Y  relative  to  X,  and  bearing  (X,Y) 
refers  to  the  direction  of  Y  relative  to  X.  Sup¬ 
pose  the  advice  is  "IF  speed(adversary)  =  low 
THEN  ACHIEVE  heading(adversary,  agent) 
=  not(head-on)"  (i.e.,  avoid  adversary).  Fig¬ 
ure  2  shows  how  SKB  nonop  rules  match  this 
advice  for  backchaining,  thereby  creating  an 
"and"  tree.  Anything  preceded  by  a  is 
operational. 

Figure  3  shows  this  algorithm  in  operation. 
Note  that  stacks  grow  downward.  The  algo¬ 
rithm  begins  by  pushing  the  advice  goal,  fol¬ 
lowed  by  the  advice  condition,  on  the  Goal- 
Stack.  It  then  calls  procedure  Compile, 
which  moves  the  advice  condition  to  the 
OpCondStack  because  it  is  operational.  The 
advice  goal  is  not  operational.  In  our  exam¬ 
ple,  the  advice  goal  can  be  unified  with  the 
goal  of  SKB  RULEl,  which  states  "IF 
bearing(agent,  adversary)  =  right  AND 
tum(agent)  =  left  THEN  heading(adversaiy, 
agent)  =  not(head-on)".  The  condition  and 
action  of  RULEl  are  pushed  on  the  Goal- 
Stack.  Because  the  condition  of  RULEl  is 
operational,  it  is  moved  to  the  OpCondStack. 

At  this  point,  the  action  of  RULEl  is  at  the 
top  of  the  GoalStack,  and  it  is  operational,  so 
we  can  create  an  op  rule.  The  conditions 
from  the  OpCondStack  are  added  to  the 
action.  This  creates  an  op  rule  that  states  "IF 
speed(adversary)  =  low  AND  bearing(agent. 
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ADVICE 


FIG.  2.  Graph  of  example. 

adversary)  =  right  THEN  tum(agent)  =  left". 
Both  stacks  are  cleared.  The  algorithm  con¬ 
tinues  similarly  to  generate  a  second  op  rule 
that  states  "IF  speed(adversary)  =  low  AND 
bearing(agent,  adversary)  =  left  THEN 
tum(agent)  =  right"  from  SKB  RULE2  (see 
Figure  2). 


Next,  we  apply  a  conversion  from  qualitative 
to  quantitative  op  rules.  The  rules  are  given 
default  quantitative  ranges.  For  example,  if 
"speed"  has  two  values,  "slow"  and  "fast",  we 
bisect  the  range  of  all  possible  values  into 
two  subranges.  Then,  we  allow  the  system  to 
improve  this  initial  choice  of  quantitative 


FIG.  3.  Example  of  the  compilation  algorithm. 
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ranges  by  using  a  GA  to  refine  the  initial 
ranges  while  interacting  with  the  environ¬ 
ment. 

3  Executing  and  Refining  Advice 

The  system  we  use  to  refine  and  apply  the  op 
rules  derived  from  our  compiled  advice  is  the 
SAMUEL  reactive  planner  (Grefenstette 
et  al.,  1990).  We  have  chosen  SAMUEL 
because  this  system  has  already  proven  to  be 
highly  effective  for  refining  rules  on  complex 
domains  (Grefenstette  et  al,  1990;  Schultz 
and  Grefenstette,  1990).  SAMUEL  adopts 
the  role  of  an  agent  in  a  multiagent  environ¬ 
ment  in  which  it  senses  and  acts.  This  system 
has  two  major  components:  a  performance 
module  and  a  learning  module.  Section  4.2 
explains  how  performance  interleaves  with 
learning  in  our  experiments. 

The  performance  module,  called  the  Com¬ 
petitive  Production  System  (CPS),  interacts 
with  a  simulated  or  real  world  by  reading 
sensors,  setting  effector  values,  and  receiving 
payoff  from  a  critic.  CPS  performs  matching 
and  conflict  resolution  on  the  set  of  op  rules. 
This  performance  module  follows  the 
match/conflict-resolution/act  cycle  of  tradi¬ 
tional  production  systems.  Time  is  divided 
into  episodes:  the  choice  of  what  constitutes 
an  episode  is  domain-specific.  Episodes 
begin  with  random  initialization  and  end 
when  a  critic  provides  payoff.  At  each  time 
step  within  an  episode,  CPS  selects  an  action 
using  a  probabilistic  voting  scheme  based  on 
rule  strengths.  All  rules  that  match  (or  par¬ 
tially  match  -  see  Grefenstette  et  al,  1990) 
the  current  state  bid  to  have  their  actions  fire. 
The  actions  of  rules  with  higher  strengths  are 
more  likely  to  fire.  If  the  world  is  being 
simulated,  then  after  an  action  fires,  the  world 
model  is  advanced  one  simulation  step  and 
sensor  readings  are  updated. 


CPS  assigns  credit  to  individual  rules  based 
on  feedback  firom  the  critic.  At  the  end  of 
each  episode,  all  rules  that  suggested  actions 
taken  during  this  episode  have  their  strengths 
incrementally  adjusted  to  reflect  the  current 
payoff.  Over  time,  rule  strengths  reflect  the 
degree  of  usefulness  of  the  rules. 

SAMUEL’S  learning  module  is  a  genetic 
algorithm.  GAs  are  motivated  by  simplified 
models  of  heredity  and  evolution  in  the  field 
of  population  genetics  (Holland,  1975).  GAs 
evolve  a  population  of  individuals  over  a 
sequence  of  generations.  Each  individual 
acts  as  an  alternative  solution  to  the  problem 
at  hand,  and  its  fitness  (i.e.,  potential  worth  as 
a  solution)  is  regularly  evaluated.  During  a 
generation,  individuals  create  offspring  (new 
individuals).  The  fimess  of  an  individual  pro¬ 
babilistically  determines  how  many  offspring 
it  can  have.  Genetic  operators,  such  as  cross¬ 
over  and  mutation,  are  applied  to  the 
offspring.  Crossover  combines  elements  of 
two  individuals  to  form  new  individuals; 
mutation  randomly  alters  elements  of  a  single 
individual.  In  SAMUEL,  an  individual  is  a 
set  of  op  rules.  In  addition  to  genetic  opera¬ 
tors,  this  system  also  applies  non-genetic 
knowledge  refinement  operators,  such  as 
"generalize"  and  "specialize",  to  op  rules 
within  a  rule  set. 

The  interface  between  our  compilation  algo¬ 
rithm  and  the  SAMUEL  system  is  straightfor¬ 
ward.  The  output  of  our  compilation  algo¬ 
rithm  is  a  set  of  op  rules  for  the  SAMUEL 
agent.  Because  the  op  rules  may  be  incom¬ 
plete,  a  random  rule  is  added  to  this  rule  set. 
The  random  rule  recommends  performing  a 
random  action  under  any  conditions.  This 
rule  set,  along  with  CPS  and  the  GA  learning 
component  for  improving  the  rules,  is  our  ini¬ 
tial  agent. 
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4  Evaluation 

We  have  not  yet  analyzed  the  cost  of  our 
compilation  algorithm.  The  worst  case  cost 
appears  to  be  exponential,  because  the 
STRIPS  planning  problem  (which  is  P-space 
complete)  can  be  reduced  to  it  In  the  future, 
we  plan  to  investigate  methods  to  reduce  this 
cost  for  complex  realistic  problems.  Poten¬ 
tial  methods  include:  (1)  attaching  a  likeli¬ 
hood  of  occurrence  onto  advice,  which 
enables  the  agent  to  prioritize  which  advice 
to  compile  first  if  time  is  limited,  (2)  tailoring 
the  levels  of  generality  and  abstraction  of  the 
advice  to  suit  the  time  available  for  compila¬ 
tion  (e.g.,  less  abstract  advice  is  closer  to 
being  operational  and  therefore  requires  less 
compilation  time),  and  (3)  generating  a  paral¬ 
lel  version  of  the  algorithm. 

We  have  evaluated  our  multistrategy 
approach  empirically.  We  focus  on  answer¬ 
ing  the  following  questions: 

•  Will  our  advice  compilation  method  be 
effective  for  a  reactive  agent  on  complex 
domains? 

•  Will  the  coordination  of  multiple  learning 
techniques  lead  to  improved  performance 
over  using  any  one  learning  method?  In  par¬ 
ticular,  we  want  the  GA  to  improve  the  suc¬ 
cess  rate  of  the  compiled  advice,  and  the 
advice  to  improve  the  convergence  rate  of 
the  GA.  An  improved  convergence  rate  is 
useful  when  learning  time  is  limited. 

•  Can  we  construct  a  portable  SKB? 

4.1  Domain  characteristics 

To  address  our  questions,  we  have  run  experi¬ 
ments  on  two  complex  problems:  Evasion 
and  Navigation.  Our  choice  of  domains  is 
motivated  by  the  results  of  Schultz  and  Gre- 
fenstette  (1990),  who  have  obtained  large 
performance  improvements  by  initializing  the 


GA  component  of  SAMUEL  with  hand- 
coded  op  rules  in  these  domains.  Their  suc¬ 
cess  has  inspired  the  work  described  here. 
Our  objective  is  to  automate  their  tedious 
manual  task,  and  the  work  described  here  is 
one  step  toward  our  goal. 

Both  problems  are  two-dimensional  simula¬ 
tions  of  realistic  tactical  problems.  However, 
our  simulations  include  several  features  that 
make  these  two  problems  sufficiently  com¬ 
plex  to  cause  difficulties  for  more  traditional 
control  theoretic  or  game  theoretic 
approaches  (Grefenstette  et  al,  1990): 

•  A  weak  domain  model.  The  learner  has  no 
initial  model  of  other  agents  or  objects  in 
the  domain.  Most  control  theoretic  and 
game  theoretic  models  make  worst  case 
assumptions  about  adversaries.  This  yields 
poor  designs  in  the  worlds  we  consider 
because  we  have  statistical  rather  than 
worst  case  adversaries. 

•  Incomplete  state  information.  The  sensors 
are  discrete,  which  causes  a  many-to-one 
mapping  and  perceptual  aliasing. 

•  A  large  state  space.  The  discretization  of 
state  space  makes  the  learning  problem 
combinatorial.  In  the  Evasion  domain,  for 
instance,  over  25  million  distinct  feature 
vectors  are  observed,  each  requiring  one  of 
nine  possible  actions,  giving  a  total  of  over 
225  million  maximally  specific  condition- 
action  pairs. 

•  Delayed  payoff.  The  critic  only  provides 
payoff  at  the  end  of  an  episode.  Therefore 
a  credit  assignment  scheme  is  required. 

•  Noisy  sensors.  Gaussian  noise  is  added  to 
all  sensor  readings.  Noise  consists  of  a 
random  draw  from  a  normal  distribution 
with  mean  0.0  and  standard  deviation  equal 
to  5%  of  the  legal  range  for  the 
corresponding  sensor.  The  value  tha; 
results  is  discretized  according  to  the 
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defined  granularity  of  the  sensor.  A  5% 
noise  level  is  sufiScient  to  degrade 
SAMUEL’S  performance. 

4.2  Experimental  design 

Two  sets  of  experiments  are  performed  on 
each  of  the  two  domains.  Perception  is 
noise-free  for  the  first  set,  but  noisy  for  the 
second.  The  primary  purpose  of  the  first  set 
is  to  address  our  question  about  the 
effectiveness  of  our  advice  compilation 
method  alone,  without  GA  refinement.  Facts, 
nonop  rules,  advice,  and  the  random  rule  are 
given  the  compiler  and  the  output  is  a  set  of 
op  rules.  This  rule  set  is  given  to  SAMUEL’s 
CPS  module  and  applied  within  the  simulated 
world  model.  The  baseline  performance  with 
which  these  rules  are  compared  is  the  random 
rule  alone.  These  experiments  measure  how 
the  average  (over  lOOO  episodes)  success  rate 
of  the  compiled  rules  compares  with  that  of 
the  baseline  as  problem  complexity  increases. 
Statistical  significance  of  the  differences 
between  the  curves  with  and  without  advice 
are  presented.  Significance  is  measured 
using  the  large-sample  test  for  the  differences 
between  two  means. 

The  primary  purpose  of  the  second  set  of 
experiments  is  to  address  our  question  about 
the  effectiveness  of  the  multistrategy 
approach  (compilation  followed  by  GA 
refinement).  Facts,  nonop  rules,  and  advice 
are  given  to  the  compiler  and  the  output  is  a 
set  of  op  rules.  This  rule  set,  plus  the  random 
rule,  becomes  every  individual  in  SAMUEUs 
initial  GA  population,  i.e.,  it  seeds  the  GA 
with  initial  knowledge.  The  baseline  perfor¬ 
mance  with  which  these  rules  are  compared 
is  SAMUEL  initialized  with  every  individual 
equal  to  just  the  random  rule.  In  either  case, 
GA  learning  evolves  this  initial  population. 
In  other  words,  we  compare  the  performance 


of  advice  seeding  the  GA  with  GA  learning 
alone  (i.e.,  random  seeding).  Random  seed¬ 
ing  produces  an  initially  unbiased  GA 
search;  advice  initially  biases  the  GA  search 
-  hopefully  into  favorable  regions  of  the 
search  space. 

In  this  second  set  of  experiimnts,  perfor¬ 
mance  interleaves  with  GA  refinement 
SAMUEL  runs  for  100  generations  using  a 
population  size  of  1(X)  rule  sets.  Every  5  gen¬ 
erations,  the  "best"  (in  terms  of  success  rate) 
10%  of  the  current  population  are  evaluated 
over  100  episodes  to  choose  a  single  plan  to 
represent  the  population.  This  plan  is 
evaluated  on  1000  randomly  chosen  episodes 
and  the  average  success  is  calculated.  This 
entire  process  is  repeated  10  times  and  the 
average  success  rate  over  all  10  trials  is 
found.  The  curves  in  our  graphs  plot  these 
averages.  For  this  set  of  experiments,  statisti¬ 
cal  significance  is  measured  using  the  two- 
sample  r-test,  with  adjustments  as  required 
whenever  the  F  statistic  indicates  unequal 
variances. 

We  add  sensor  noise,  as  defined  in  Section 
4.1,  for  this  second  set  of  experiments 
because  GAs  can  learn  robustly  in  the  pres¬ 
ence  of  noise  (Grefenstette  et  al,  1990). 
Two  performance  measures  are  used:  the 
success  rate  and  the  convergence  rate.  The 
convergence  rate  is  defined  as  the  number  of 
GA  generations  required  to  achieve  and 
maintain  an  n%  success  rate,  where  n  is 
different  for  each  of  the  two  domains.  The 
value  of  n  is  set  empirically. 

4  J  Evaluation  on  the  Evasion  problem 

Our  simulation  of  the  Evasion  problem  is  par¬ 
tially  inspired  by  (Erikson  and  Zytkow, 
1988).  This  problem  consists  of  an  agent, 
which  is  controlled  by  SAMUEL,  that  moves 
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in  a  two-diniensi<xud  world  with  a  single 
adversary  pursuing  the  agent  The  agent’s 
objective  is  to  avoid  contact  with  the  adver¬ 
sary  for  a  bounded  length  of  time.  Contact 
implies  the  agent  is  captured  by  the  adver¬ 
sary.  The  problem  is  divided  into  episodes 
that  begin  with  the  adversary  approaching  the 
agent  from  a  random  direction.  The  adver¬ 
sary  initially  travels  faster  than  the  agent,  but 
is  less  maneuverable  (i.e.,  it  has  a  greater 
turning  radius).  Both  the  agent  and  the 
adversary  gradually  lose  speed  when 
maneuvering,  but  only  the  adversary’s  loss  is 
permanent.  An  episode  ends  when  either  the 
adversary  captures  the  agent  (failure)  or  the 
the  agent  evades  the  adversary  (success).  At 
the  end  of  each  episode,  a  critic  provides  full 
payoff  for  successful  evasion  and  partial 
payoff  otherwise,  proportional  to  the  amount 
of  time  before  the  agent  is  captured.  The 
strengths  of  op  rules  that  fired  are  updated  in 
proportion  to  the  payoff. 

The  agent  has  the  following  operational  sen¬ 
sors:  time,  last  agent  turning  rate,  adversary 
speed,  adversary  range,  adversary  bearing, 
and  adversary  heading.  The  agent  has  one 
operational  action:  it  can  control  its  own 
turning  rate.  For  further  detail,  see  (Grefen- 
stette  et  al.,  1990). 

In  our  experiments,  we  provide  the  following 
domain-specific  knowledge: 

FACTS 

Chased_by(agent,  adversary). 

Moving(agent).  Moving(adversary). 

NONOP  RULES 

IF  chased_by(X,  Y)  AND  range(X,  Y)  =  close 
AND  tum(X)  =  Z  THEN  tum(y)  =  Z. 

IF  range(X,  50  =  not(close)  AND  heading(y, 
X)  =  not(head_on)  THEN  avoids(X,  Y). 

IF  tum(adversary)  =  hard  THEN 
decelerates(adversary). 


ADVICE 

IF  ^)eed(adversary)  =  high  THEN  ACHIEVE 
decelerates(adversary). 

IF  speed(adversary)  =  low  THEN  ACHIEVE 
avoids(agent,  adversary). 

We  also  include  knowledge  of  the  agent’s 
operational  sensors  and  actions  as  facts. 

Ten  SKB  nonop  rules  are  used  (they  ,  axe 
instantiations  of  the  two  rules  described  in 
English  in  the  introduction  of  this  paper). 
Although  room  does  not  permit  listing  them 
all,  some  examples  are: 

IF  bearing(X,y)  =  right  AND  tum(X)  =  left 
THEN  heading(yX)  =  not(head_on). 

IF  bearing(X,y)  =  left  AND  moving(X)  AND 
tum(X)  =  left  THEN  range(X,y)  =  close. 

From  our  input  and  our  SKB  rules,  the  com¬ 
pilation  method  of  Section  2  generates  op 
rules.  The  sensor  values  of  these  rules  are 
translated  from  qualitative  values  to  default 
quantitative  ranges.  For  example,  "bearing  = 
left"  is  translated  into  "bearing  =  [6..  12]", 
where  the  numbers  correspond  to  a  clock, 
e.g.,  6  means  "6  o’clock".  Every  new  rule  is 
given  a  strength  of  1.0  (the  maximum).  The 
final  op  rule  set  includes  rules  such  as:^ 

IF  speed(adversary)  =  [700..  1000]  AND 
range(agent,  adversary)  =  [0..750] 

THEN  tum(agent)  =  hard-left. 

The  total  number  of  op  rules  generated  from 
our  advice  is  16. 

We  begin  our  experiments  by  addressing  the 
first  question,  which  concerns  the 

’  To  genente  a  few  of  these  rales,  we  used  a  variant  of  our 
complation  algorithm.  We  omitted  a  description  of  this  variatian 
for  the  sake  of  clarity.  See  (Gordon  and  Subiamanian,  1993)  for 
details. 
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effectiveness  of  our  advice-taking  method. 
We  do  not  use  the  GA.  Problem  diflSculty  is 
varied  by  adjusting  a  "safety"  envelope 
around  the  agent.  The  "safety"  envelope  is 
the  distance  at  which  the  adversary  can  be 
from  the  agent  before  the  agent  is  considered 
captured  by  the  adversary. 

Figure  4  shows  how  the  performance  (aver¬ 
aged  over  1(XX)  episodes)  of  these  op  rules 
compares  with  that  of  just  the  random  rule. 
All  of  the  differences  between  the  means  are 
statistically  significant  (using  significance 
level  a  =  0.05).  From  Figure  4  we  see  that 
from  difiiculty  levels  80  to  120,  the  agent  is 
approximately  twice  as  successful  with 
advice  than  without  it.  This  is  a  100%  per¬ 
formance  advantage.  Furthermore,  for  levels 
120  to  160,  the  agent  is  about  four  times  more 
effective  with  advice.  For  levels  160  to  200, 
the  agent  is  an  order  of  magnitude  more 
effective  with  advice.  We  conclude  that  as 
the  difficulty  of  this  problem  increases,  the 
advice  becomes  more  helpful.  These  results 
answer  our  first  question:  our  advice  compi¬ 
lation  method  is  effective  on  this  domain. 

We  address  the  second  question  about  multis¬ 
trategy  effectiveness  by  combining  the  com¬ 
piled  advice  with  GA  refinement.  Figure  5 
shows  the  results  of  comparing  the  perfor¬ 
mance  of  the  GA  with  and  without  advice. 


The  "safety"  envelope  is  fixed  at  1(X)  (chosen 
arbitrarily)  and  noise  is  added  to  the  sensors. 
For  this  domain,  the  convergence  rate  is  the 
number  of  GA  generations  required  to  main¬ 
tain  a  60%  success  rate. 

Figure  5  shows  that  in  a  small  amount  of  time 
(less  than  10  generations),  the  GA  provides  a 
substantial  (50%)  improvement  in  success 
rate.  However,  the  convergence  rate  with 
and  without  advice  is  the  same.  Furthermore, 
although  prior  to  50  generations  the 
differences  between  the  means  are  not  statist¬ 
ically  significant  (other  than  the  initial  boost 
provided  by  the  advice),  after  50  generations 
the  improvement  without  advice  over  advice 
is  significant  (a  =  0.05).  Therefore,  this 
domain  fails  to  demonstrate  the  superiority  of 
combining  strategies.  We  conjecture  that  our 
advice  is  not  properly  biasing  the  GA  into  the 
most  favorable  regions  of  the  search  space. 

Note  that  these  results  illuminate  the  tight 
coupling  that  exists  between  the  two  stra¬ 
tegies  in  our  multistrategy  system.  Con¬ 
sistently  high  performance  depends  not  only 
on  successful  compilation  of  advice.  It  also 
depends  on  how  the  advice  initially  biases  the 
GA.  A  ripe  area  for  future  research  is  to 
experimentally  determine  effective  initializa¬ 
tion  methods. 
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FIG.  4  Evasion  domain. 


FIG.  5.  Evasion  domain  with  GA. 
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4.4  EvaluatiiNi  on  the  Navigation  problem 

In  the  Navigaticm  domain,  our  agent  is  again 
controlled  by  SAMUEL  in  a  two-dimensional 
simulated  world.  The  agent’s  objective  is  to 
avoid  obstacles  and  navigate  to  a  stationary 
target  with  which  it  must  rendezvous  before 
exhausting  its  fuel  (in^lemented  as  a 
bounded  length  of  time  for  motion).  Each 
episode  begins  with  the  agent  centered  in 
front  of  a  randonnly  generated  field  of  obsta¬ 
cles  with  a  specified  density.  An  episode 
ends  with  either  a  rendezvous  at  the  target 
location  (success)  or  the  exhaustion  of  the 
agent’s  fuel  or  a  collision  with  an  obstacle 
(failure).  At  the  end  of  an  episode,  a  critic 
provides  full  payoff  if  the  agent  reaches  the 
target,  and  partial  payoff  otherwise,  depend¬ 
ing  on  the  agent’s  distance  to  the  goal. 

The  agent  has  the  following  operational  sen¬ 
sors:  time,  the  bearing  of  the  target,  the  bear¬ 
ing  and  range  of  an  obstacle,  and  the  range  of 
the  target.  The  agent  has  two  operational 
actions:  it  can  control  its  own  turning  rate 
and  its  speed.  For  further  detail,  see  Schultz 
and  Grefensette  (1992). 

We  provide  the  following  domain-specific 
knowledge  (in  addition  to  a  list  of  operational 
sensors  and  actions): 

FACTS 

Moving(agent). 

NONOP RULES 

IF  range(X,  Y)  =  not(closc)  AND  heading(y, 
X)  =  not(head_on)  THEN  avoids(X,  F). 
ADVICE 

IF  range(agent,  obstacle)  =  not(close)  THEN 
ACHIEVE  range(agent,  target)  =  close. 

IF  range(agent,  obstacle)  =  close  THEN 
ACHIEVE  avoids(agent,  obstacle). 

ACHIEVE  speed(agent)  =  high. 


Evasicm  are  again  used  for  this  domain, 
which  confirms  the  portability  of  our  SKB, 
thus  addressing  the  third  question.  A  total  of 
42  op  rules  are  generated.^ 

Again,  we  address  the  first  question  by  uang 
SAMUEL  without  the  GA.  The  success  rate 
is  averaged  over  1000  episodes.  Without 
advice,  the  average  success  rate  is  0% 
because  this  is  a  very  difficult  domain.  Figure 
6  shows  how  we  improve  the  success  rate  to 
as  much  as  90%  by  using  our  advice  on  this 
domain.  At  all  but  the  last  few  points,  the 
differences  between  the  means  are  statisti¬ 
cally  significant  (a  =  0.05).  When  we  vary 
the  number  of  obstacles,  performance  follows 
a  different  trend  than  for  the  Evasion  domain. 
By  far  the  greatest  benefit  of  the  advice 
occurs  when  there  are  few  obstacles.  Perfor¬ 
mance  is  10  times  better  when  there  is  only 
one  obstacle,  for  example.  The  advantage 
drops  as  the  problem  complexity  increases. 
After  difficulty  level  80,  advice  no  longer 
offers  any  benefit. 

Our  experiments  on  both  domains  confirm 
that  our  advice  compiler  can  be  effective, 
however,  they  also  indicate  that  the  useful¬ 
ness  of  advice  may  be  restricted  to  a  particu¬ 
lar  range  of  situations.  Another  learning  task, 
which  we  are  currently  exploring,  would  be 
to  identify  this  range  and  add  additional  con¬ 
ditions  to  the  advice. 

We  address  the  second  question  by  compar¬ 
ing  the  performance  of  the  GA  with  and 
without  advice.  Noise  is  added.  Figure  7 
shows  the  results.  All  differences  between 
the  means  are  statistically  significant  (a  = 
0.05).  Here,  the  number  of  obstacles  is  fixed 
at  five  (chosen  arbitrarily).  For  this  domain, 

’  We  were  able  to  decrease  the  number  of  op  rules  to  9  by 
making  one  careful  qualitative  to  quantitative  mapping  dioice. 


The  same  10  SKB  nonop  rules  used  for 
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FIG.  6.  Navigation  domain. 

the  convergence  rate  is  the  number  of  GA 
generations  required  to  maintain  a  95%  suc¬ 
cess  rate. 

Figure  7  shows  that  the  addition  of  advice 
yields  an  enormous  performance  advantage 
on  this  domain.  Figure  7  also  shows  that 
given  a  moderate  amount  of  time  (10  genera¬ 
tions),  the  GA  provides  a  10%  increase  in  the 
success  rate.  Furthermore,  the  addition  of 
advice  produces  an  18-fold  improvement  in 
the  convergence  rate  over  using  GAs  alone. 
Not  only  does  advice  improve  the  conver¬ 
gence  rate,  but  it  also  improves  the  level  of 
convergence:  after  80  generations,  the  GA 
with  advice  holds  a  99%  or  above  success 
rate  whereas  after  all  100  generations  the  GA 
without  advice  still  cannot  get  above  a  97% 
success  rate.  For  this  problem,  the  advice 
appears  to  be  biasing  the  GA  into  a  very 
favorable  region  of  the  search  space. 

To  further  test  our  compilation  method,  we 
have  recompiled  our  Navigation  advice  into 
op  rules  for  a  Nomad  200  mobile  robot  that  is 
equipped  with  very  noisy  sonar  and  infrared 
sensors  and  can  adjust  its  tinning  rate  and 
speed.  The  sensors  are  so  noisy  that  the  robot 
sometimes  mistakes  two  boxes  four  feet  apart 
for  a  wall.  The  op  rules  that  result  from  com¬ 
pilation  have  not  been  refined  by  the  GA  to 
develop  a  tolerance  to  noise;  therefore,  this 


FIG.  7.  Navigation  domain  with  GA. 

noise  poses  a  severe  challenge. 

The  op  rules  are  linked  to  a  vendor-provided 
interface  that  translates  the  language  of  the 
SAMUEL  rules  (e.g.,  "IF  asonar4  [17..85] 
THEN  SET  turn  -400")  into  joint  velocity  and 
servo  motor  commands.  From  high-level 
advice  to  avoid  obstacles  and  rendezvous 
with  a  goal  point,  our  method  has  compiled 
rules  that  enable  the  robot  to  succeed  approx¬ 
imately  a  third  of  the  time  in  avoiding  three 
large  boxes  and  reaching  a  goal  point  on  the 
other  side  of  a  room.  The  same  SKB  rules 
are  used  for  compilation.  With  the  random 
rule  alone,  it  is  extremely  unlikely  to  success¬ 
fully  complete  this  task.  Our  next  step  will 
be  to  refine  these  robot  rules  using  GAs 
within  a  simulation.^ 

In  conclusion,  our  multistrategy  system  offers 
two  advantages.  First,  it  provides  an  initial 
"boost"  from  seeding  with  initial  high-level 
knowledge.  Although  this  boost  is  insubstan¬ 
tial  on  Evasion,  on  Navigation  we  see  an 
order  of  magnitude  in  improvement  in  the 
convergence  rate.  Second,  the  multistrategy 
system  provides  the  robusmess  and  improve¬ 
ment  gained  from  GA  refinement. 

*  We  wish  lo  use  SAMUEL  bcih  to  haiuUe  the  noise  and 
because  we  had  to  manually  lehne  the  qualitative  to  quantitative 
mappings  somewhat  -  SAMUELcould  automate  this. 


230 


Refinement  yields  a  10%  increase  in  success 
rate  on  Navigation  and  a  50%  increase  on 
Evasion. 

5  Related  Work 

This  work  relates  most  strongly  to  the  follow¬ 
ing  topics  in  machine  learning:  advice  tak¬ 
ing,  combining  projective  and  reactive  plan¬ 
ning,  methods  for  compiling  high-level  goals 
into  reactive  rules,  learning  in  fuzzy  controll¬ 
ers,  and  multistrategy  learning.  This  work 
also  relates  to  research  in  differential  game 
theory.  We  discuss  each  in  turn. 

5.1  Machine  learning 

Advice  taking  has  been  considered  as  early 
as  1958  (McCarthy)  and  later  by  Mostow 
(1983)  and  others.  To  date,  research  on 
assimilating  advice  in  embedded  agents  has 
been  limited  but  encouraging.  Previous 
research  has  focused  mainly  on  providing 
low-level  knowledge.  I:or  example,  Laird 
et  al.  (1990)  and  Clouse  and  Utgofif  (1992) 
have  had  good  success  providing  agents  with 
information  about  which  action  to  take. 
Chapman  (1990)  gives  his  agent  high-level 
advice.  Our  advice  taker  differs  fipom 
Chapman’s  because  it  can  operationalize 
advice  long  before  the  advice  is  applied  and 
because  it  refines  the  advice  with  a  GA. 
Most  important  of  all,  our  advice  taking 
method  is  unique  because  it  involves  a  mul¬ 
tistrategy  approach  that  couples  a 
knowledge-intensive  deductive  precompila¬ 
tion  phase  with  an  empirical  inductive 
refinement  phase. 

We  assume  that  high-level  knowledge  is 
operationalized  but  not  applied  immediately. 
Methods  for  operationalizing  advice  that  will 
be  applied  immediately  include  STRIPS-like 
planners  (Nilsson,  1980)  and  explanation- 
based  learning  (EBL)  planners  (e.g.,  Segre, 


1988).  A  closely  related  system  is  Mitchell’s 
(1990).  This  system  combines  EBL  projec¬ 
tive  planning  with  reactive  planning.  Our 
method  for  compiling  goals  is  similar  to  that 
of  EBL  because  it  uses  the  notion  of  opera- 
tionality.  It  differs  because  we  do  not  assume 
that  the  advice  will  be  applied  immediately, 
and  therefore  our  compilation  method  has  no 
current  state  on  which  to  focus  plan  genera¬ 
tion.  All  of  the  above-mentioned  methods 
create  a  projective  plan  to  achieve  a  goal 
from  the  current  state.  We  precompile  advice 
for  multiple  possible  states. 

Because  our  method  precompiles  plans  from 
possible  states  rather  than  from  a  current 
state,  it  is  very  similar  to  the  methods  of 
Schoppers  (1987)  and  Kaelbling  (1988)  for 
compiling  high-level  goals  into  low-level 
reactive  rules.  Our  method  differs  from  those 
of  Schoppers  and  Kaelbling  because  it 
includes  the  EBL  notion  of  operationality. 
Also  unlike  Schoppers  and  Kaelbling,  we  use 
a  refinement  method  following  compilation. 

Considerable  prior  work  has  focused  on 
knowledge  refinement.  Others  have  used 
GAs  to  refine  qualitative  to  quantitative  map¬ 
pings.  For  example,  Karr  (1991)  uses  GAs  to 
select  fuzzy  membership  functions  for  a 
fuzzy  controller.  Lin  (1991),  Mahadevan  and 
Connell  (1991),  and  Singh  (1991)  initialize 
their  systems  with  modular  agent  architec¬ 
tures  then  refine  them  with  reinforcement 
learning.  Lin  trains  a  robot  by  giving  it 
advice  in  the  form  of  a  sequence  of  desired 
actions.  Mahadevan  and  Connell  initialize 
their  reinforcement  learner  with  a 
prespecified  subsumption  architecture,  and 
Singh  guides  his  reinforcement  learner  by 
giving  it  abstract  actions  to  decompose. 

One  of  the  most  similar  approaches  to  ours  is 
that  of  Towell  and  Shavlik  (1991).  They  also 


couple  rule-based  input  with  a  refinement 
method;  however,  their  refinement  method  is 
neural  networks.  This  multistrategy  system 
converts  rules  into  a  network  topology.  The 
content  of  each  rule  is  preserved;  therefore, 
the  transformation  is  syntactic.  Our  multis¬ 
trategy  system,  on  the  other  hand,  focuses 
primarily  on  semantic  transformations  that 
use  qualitative  knowledge  about  movements 
in  space  to  conven  abstract  goals  into  con¬ 
crete  actions.  The  deductive  compilation 
scheme  (but  not  the  refinement)  is  in  common 
with  Mitchell’s  (1987)  derivation  of  a  stra¬ 
tegy  for  pushing  objects  in  a  tray  using  a 
qualitative  theory  of  the  process. 

5.2  Differential  game  theory 

Differential  game  theory  is  a  branch  of 
mathematical  optimal  control  theory.  It 
assumes  that  the  behavior  of  the  controlled 
system  can  be  modeled  as  a  system  of  ordi¬ 
nary  differential  equations  (ODEs).  The  eva¬ 
sion  problem  considered  in  this  paper  is  a 
typical  example  of  a  differential  game.  In 
particular,  the  problem  is  two-person  zero- 
sum  differential  game  with  a  constant  termi¬ 
nal  time.  Both  the  pursuer  and  the  evader 
move  in  a  bounded  rectangle  in  two  dimen¬ 
sions.  The  evader  has  to  avoid  getting  to 
within  a  certain  distance  of  the  pursuer  for  a 
certain  length  of  time.  In  the  minimax  formu¬ 
lation  of  the  problem,  the  optimal  strategy  of 
the  evader  is  one  that  achieves  its  objective 
under  the  least  favorable  assumptions  on  the 
motion  of  the  pursuer. 

Differential  games  are  formulated  mathemati¬ 
cally  by  specifying  the  motion  equations  of 
the  pursuer  and  evader,  the  class  of  admissi¬ 
ble  controls  for  both  systems  (which 
identifies  the  way  in  which  the  pursuer  and 
evader  can  change  their  motions),  and  the 
target  or  goal  functional.  A  classic  reference 


for  this  is  (Basar  and  Olsder,  1982).  The 
focus  of  work  in  differential  game  theory  is  to 
identify  conditions  under  which  optimal  stra¬ 
tegies  for  the  evader  can  be  derived.  This 
assumes  complete  knowledge  of  the  dynam¬ 
ics  of  the  evader  and  pursuer,  both  of  which 
are  unavailable  to  us.  The  theory  would  be 
more  useful  to  us  if  it  had  a  qualitative  coun¬ 
terpart  which  allowed  us  to  determine  the 
existence  of  solutions  to  the  evader’s  problem, 
from  partial  knowledge  of  the  evader  and 
pursuer’s  dynamics. 

6  Discussion 

We  have  presented  a  novel  multistrategy 
learning  method  for  operationalizing  and 
refining  high-level  advice  into  low-level  rules 
to  be  used  by  a  reactive  agent.  Operationali¬ 
zation  uses  a  portable  SKB.  An  implementa¬ 
tion  of  this  method  has  been  tested  on  two 
complex  domains  and  a  Nomad  200  robot. 

We  have  learned  the  following  lessons: 

(1)  Our  advice  compiler  can  be  effective  on 
complex  domains,  and  it  will  be  important  to 
identify  the  regions  of  greatest  effectiveness 
for  advice,  (2)  A  portable  SKB  appears  feasi¬ 
ble,  and  (3)  Coordinating  a  deductive  learn¬ 
ing  strategy  (advice  compilation)  with  an 
inductive  learning  strategy  (GA  refinement) 
can  lead  to  a  substantial  performance 
improvement  over  either  method  alone.  This 
success,  however,  depends  on  the  how  the 
advice  biases  the  GA  search.  Future  work 
will  focus  on  identifying  those  characteristics 
of  advice  that  bias  this  search  favorably.  We 
will  also  focus  on  further  addressing  our 
questions  about  performance  using  different 
advice  and  alternative  domains  (e.g., 
Subramanian  and  Hunter,  1993). 

Many  other  interesting  directions  are  sug¬ 
gested  by  our  experimental  results.  At 
present  we  do  not  consider  the  cost  of 
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incorporating  advice.  For  larger  scale  prob¬ 
lems  and  situations  where  advice  is  provided 
more  frequently,  the  agent  has  to  reason 
about  the  costs  and  benefits  of  compiling 
advice  at  a  given  point  in  time.  Classical 
issues  in  trading  off  deliberation  time  for 
action  time  are  relevant  hm.  We  have 
chosen  the  GA  method  for  refinement 
because  it  was  readily  available  to  us.  A  com¬ 
parison  of  neural  network  refinement 
schemes  and  reinforcement  learning  schemes 
on  the  problems  studied  here  will  provide 
valuable  insights  into  the  tradeoflEs  between 
various  refinement  strategies.  We  believe  that 
multistrategy  learning  systems  of  the  future 
must  have  a  bank  of  operationalization  and 
refinement  methods  at  their  disposal  and  have 
fast  methods  for  selecting  them.  We  have 
chosen  a  specific  breakdown  of  effort 
between  the  advice  compilation  and 
refinement  phases.  How  this  coordinates  with 
our  choice  of  problem  domains  and 
refinement  schemes  is  another  question  for 
future  study. 
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Abstract 

Inducing  concept  descriptions  from  examples 
requires  a  large  space  of  hypotheses  to  be 
explored.  Genetic  algorithms  offer  an 
appealing  alternative  to  traditional  search 
algorithms,  because  of  their  multi-point  search 
strategy.  In  this  paper,  the  new  system 
REGAL  is  described:  it  uses  genetic 
algorithms  to  learn  first  order  logic  concept 
descriptions.  Moreover,  it  can  be  easily 
integrated  with  v'^uctive  component,  in 
or’er  to  exploi!  -  domain  theory.  Two 
approaches  to  learning  disjunctive  concept 
descriptions  are  presented:  the  first  one  is  a 
modiHcation  of  the  classical  method  of 
learning  one  disjunct  at  a  time,  whereas  the 
second  one  is  based  on  the  idea  of  fltness 
sharing  and  tries  to  let  subpopulations  be 
spontaneously  formed,  according  to  the  theory 
of  the  niches  and  species.  The  approaches 
have  been  compared  on  an  artificial  domain. 

Key  words:  Learning  relations.  Genetic 
algorithms 

1.  Introduction 

Inducing  concept  descriptions  from  examples 
and  background  knowledge  is  a  fundamental 
machine  learning  task  which  can  be 
formulated  as  a  search  problem  (Mitchell, 


1982;  Michalski,  1983).  What  distinguishes  an 
approach  from  another  is  the  concept 
description  language,  the  search  method  and 
the  hypothesis  selection  Mterion. 

Genetic  algorithms  offer  a  powerful,  domain- 
independent  search  method:  they  have  been 
first  used  in  machine  learning  associated  to  the 
“classifier”  model  (Holland,  1986),  but, 
recently,  they  have  also  been  used  for  concept 
induction,  both  in  propositional  calculus  (De 
Jong  &  Spears,  1991;  Vafaie  &  De  Jong, 
1991;  Bala  et  al.,  1991;  Janikov,  1992)  and  in 
first  order  logic  (Giordana  &  Sale,  1992). 
From  these  first  experiments,  genetic 
algorithms  proved  to  be  an  appealing 
alternative  for  traditional  search  algorithms, 
because  of  their  great  exploration  power, 
useful  to  escape  local  minima,  and  their 
suitability  to  exploit  massive  parallelism. 

This  paper  extends  in  many  respects  the 
framework  presented  in  (Giordana  &  Sale, 
1992)  in  order  to  overcome  some  limitations 
of  that  approach.  In  particular,  the  algorithm 
GA-SMART,  presented  there,  could  only 
learn  concepts  described  in  conjunctive 
normal  form,  with  no  explicit  negation.  In  the 
present  extension,  multi-modal  concepts, 
described  in  a  more  powerful  first  order  logic 
language,  can  be  learned;  in  this  language 
negation  of  single  atoms  and  negation  of 
existentially  quantified  formulas  may  occur. 
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The  resulting  inductive  algorithm  can  be 
easily  integrated  into  a  deductiveAnducdve 
paradigm  such  as  the  one  described  by 
Bergadano  and  Giordana  (1988). 

2.  System  Overview 

The  system  REGAL,  described  in  this  paper, 
is  an  evolution  of  GA-SMART  (Giordana  & 
Sale,  1992)  and  adopts  the  same  method  of 
encoding  first  order  logic  (FOL)  formulas  into 
bit  strings.  GA-SMART  was  a  straightforward 
evolution  of  the  Simple  Genetic  Algorithm 
proposed  by  De  Jong  (1975). 

Two  alternative  approaches  are  proposed  in 
the  literature  in  order  to  encode  the  “problem 
solutions”  handled  by  genetic  algorithms.  The 
first  one  suggests  a  fixed-length  bit  string 
representation.  In  this  way,  the  genetic 
algorithm  architecture  becomes  problem 
independent  and  a  substantial  amount  of 
investigation,  available  in  the  literature 
(Goldberg,  1989a),  can  be  exploited  to  design 
the  genetic  operators.  The  drawback  is  that  it 
may  be  difficult  to  formulate  a  problem  in 
such  a  way  that  it  can  be  encoded  using  this 
representation.  The  second  approach  prefers 
an  encoding  scheme  more  closely  fitting  the 
original  formulation  of  the  problem.  The 
drawback  is  that  the  genetic  operators  must  be 
designed  ad  hoc.  For  concept  induction,  a 
variable-length  bit  string  representation  has 
been  used  in  (De  Jong  &  Spears,  1991)  and  a 
tree- like  representation  in  (Janikov,  1991).  In 
both  cases,  special  genetic  operators  and 
control  strategies  have  been  designed. 

In  GA-SMART  and  REGAL  we  chose  the 
first  approach  and,  hence,  we  devoted  our 
efforts  to  the  proper  encoding  of  FOL 
formulas  into  fixed-length  bit  strings.  This  has 
been  achieved  by  restricting  the  hypothesis 


representation  language  L  in  such  a  way  that 
it  becomes  finite.  The  complexity  of  L  is 
defined  through  a  language  template  A, 
which  represents  the  maximally  complex 
formula  in  L.  Any  other  well  formed  formula 
is  obtained  by  deleting  some  literal  from  A. 

In  GA-SMART,  formulas  in  L  were  in 
conjunctive  normal  form  and  negation  was  not 
explicitly  allowed.  The  richer  language  used 
in  REGAL  allows  the  system  to  learn  the  same 
set  of  formulas  as  other  systems  do 
(Bergadano  et  al.,  1988,  1991;  Quinlan,  1990; 
Pazzani  &  Kibler,  1992).  Another  important 
advantage  of  REGAL  is  its  suitability  to  be 
integrated  with  a  deduction  system  and  used  to 
refine,  by  induction,  inconsistent  concept 
descriptions  generated  by  EBG  (Mitchell, 

1986)  with  an  incomplete  and/or  inconsistent 
domain  theory.  Finally,  a  further  improvement 
introduced  in  REGAL  is  the  use  of  a  new  kind 
of  sharing  functions  (Goldberg  &  Richardson, 

1987) ,  allowing  the  formation  of  sub¬ 
populations.  Crowding  was  already  used  in 
GA-SMART  in  order  to  learn  many  concepts 
at  the  same  time.  Here,  the  technique  is 
modified  in  order  to  learn  concept  descriptions 
expressed  as  disjunctions  of  Horn  clauses. 

REGAL's  learning  strategy  consists  in 
searching  for  a  maximum  value  of  the  fimess 

function  f((p)  associated  to  a  formula  (p  €  L. 
The  algorithm  starts  with  a  set  A(0)  of 

formulas  (individuals)  randomly  selected  in  L. 
Each  individual  (pj  e  A(0)  is  associated  to  the 
corresponding  fimess  value  f((pj).  In  order  to 
explore  new  points  in  the  search  space,  the 
population  A(0)  is  let  evolving  by  applying 
five  operators:  reproduction,  mating, 
crossover,  mutation  and  seeding.  Let  A(t)  be 
the  solution  population  at  time  t;  reproduction 
consists  in  selecting  a  multiset  A’(t)  of 
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individuals  by  sampling  the  elements  of  A(t) 
with  probability  proportional  to  the 
corresponding  value  of  f.  Then,  in  the  mating 
phase,  the  elements  of  A'(t)  are  randomly 
paired,  and  each  pair  undergoes  crossover 
with  probability  p^,  i.e.,  new  offsprings  are 
created  by  recombining  in  some  way  the 
information  encoded  in  the  parents.  The  new 
individuals  partially  or  totally  replace  A(t), 
obtaining  a  new  generation  A(t+1).  Basically, 
the  algorithm  searches  for  the  highest  values 
of  f  by  statistically  recombining  tentative 
solutions  already  discovered.  As  this  strategy 
could  cause  a  loss  of  essential  information,  the 
seeding  operator  (in  addition  to  classical 
mutation)  is  exploited  to  introduce  new 
information  in  the  evolution  process. 

As  mentioned  in  Section  1,  the  basic  scheme 
described  above  has  been  made  a  little  more 
sophisticated,  by  incorporating  the 
mechanisms  of  crowding  (De  Jong,  1975)  and 
of  sharing  functions  (Goldberg  &  Richardson, 
1987)  in  order  to  allow  the  formation  of 
subpopulations.  This  turns  out  to  be  useful 
when  many  concepts  are  to  be  learned  at  the 
same  time  or  when  concepts  have  a 
multimodal  structure,  which  requires 
disjuctive  descriptions  to  be  learned. 

REGAL  can  work  serially  as  well  as  in 
parallel  on  a  multi-processor  according  to  the 
network  model  described  by  Goldberg 
(1989b).  A  discussion  of  the  parallel 
implementation  on  a  16  transputers  network 
can  be  found  in  (Giordana  &  Sale,  1992).  The 
results  presented  in  this  paper,  have  been 
obtained  using  a  single  processor  workstation 
Sun  10. 


3.  Knowledge  Representation  and 
Encoding 

The  language  L,  used  by  REGAL,  is  a  first 
order  logic  language  containing  conjunction, 
disjunction,  negation,  internal  disjunction 
(Michalski,  1983)  and  existential  and 
universal  quantification.  The  building  blocks 
of  the  language  are  “internally  disjunct” 
atoms,  such  as,  for  instance: 

colour(x,  yellow  v  green)  (3. 1) 

in  which  one  of  the  arguments  of  a  predicate  is 
a  disjunction  of  constants.  Formula  (3.1)  is 
semantically  equivalent  to: 

colour(x,  yellow)  v  colour(x,  green)  (3.2) 

but  is  more  compact  and  readable. 

More  in  general,  a  predicate  P  of  arity  m  is 
specified  using  the  following  syntax: 

P(Xi,X2,  ...,Xn„[Vi,V2 . V„])  (3.3) 

where  the  complex  term  [v^,  V2,  ....,  v^] 
denotes  the  maximal  internal  disjunction,  i.e., 
the  set  of  all  the  possible  values  the  feature  P 
can  assiune  on  the  tuple  of  variables  <  x^,  X2, 
...,  Xn,>.  Any  other  disjunction  inside  P  is 
represented  by  a  subset  of  the  set  [vj,  V2,  ...., 
VnJ.  If  the  set  [v^,  V2, .... ,  v„]  does  not  exhaust 
the  possible  values  of  P,  this  set  is  completed 
by  means  of  the  symbol  Then,  in  the 

formula  P(xi,  X2 . Xn,,  [vj,  \2 . *])  the 

symbol  “*  ”  denotes  “no  one  of  the  values  Vj, 
V2,  ....  ,  Vn”.  In  other  words,  is  an 
abbreviation  for  the  expression  -i  Vj  a  -i  a 
....A  -1  Vfl.  Therefore,  the  symbol  allows  a 
restricted  form  of  negation  to  be  implemented. 
As  an  example,  let  us  consider  the  predicate 
colour(x,  [yellow,  green,  blue,  grey,  *].  Then, 
colour(x,  [yellow,  ♦])  is  equivalent  to 
colour(x,  yellow  v  [-1  yellow  a  -1  green  a  -i 
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blue  A  -I  gray])  or,  equivalently,  to  colour(x, 
green)  a  -i  colour(x,  blue  )  a  -i  colour(x, 
grey). 

Deleting  a  term  from  an  internal  disjunction  is 
a  specialization  operation.  For  instance^ : 

P(x,  [vj)  k  P(x,  [vi,  *])  k  P(X,  [Vj,  V2,  *]) 

In  particular,  a  predicate  is  said  in  maximally 
specific  form  {msf  )  when  its  internal 
disjunction  contains  only  one  term,  and  in 
maximally  general  form  {mgf ),  when  its 
internal  disjunction  contains  all  the  possible 
values,  including  “  *  ”.  We  notice  that,  owing 
to  the  completeness  hypothesis,  a  predicate 
with  an  empty  internal  disjunction  has  to  be 
considered  illegal  and  one  in  mgf  is 
tautologically  true. 

A  second  form  of  negation,  greatly  increasing 
the  power  of  the  language  L,  is  the  negation 
of  existentially  quantified  formulas.  This  form 
of  negation,  widely  used  in  logic 
programming,  can  be  learned  by  systems  such 
as  ML-SMART  (Bergadano  et  al.  1988, 
1991),  FOIL  (Quinlan,  1990, 1991)  and  FOCL 
(Pazzani  &  Kibler,  1992).  A  negated 
existentially  quantifred  formula  has,  in  L,  the 
following  syntax: 

yi,...,  yn,  [v(xi . x„,  yi . yj]  (3.4) 

where  \jf  is  a  disjunction  of  (possibly  internally 
disjunct)  predicates,  each  one  containing  at 
least  one  variable  in  the  set  yi,  y2,  ,  ym-  The 

genetic  operators  can  deal  with  negated 
formulas  in  a  similar  way  as  they  do  with 
positive  ones. 


^  The  relation  "y  is  more  general  than  y"  is 
denoted  by  9  I  <  y,  according  to  Michalski 
[19831. 


Deleting  a  term  from  the  internal  disjunction 
of  a  predicate  occurring  in  y  is  a 
generalization  operation,  as  it  appears  from 
the  equivalence  of  formulas  (3.5)  and  (3.6): 

-i3x  [(P(x,  [a  V  bj)  V  (Q(x,  [e  v  f])]  (3.5) 

Vx  [-1  P(x,  a)  A-i  P(x,  b)  A 

-.Q(x,  a)A  -iQ(x,b)]  (3.6) 

Finally,  the  occurrence  inside  y  of  a  predicatr 
in  mgf  leads  to  an  absurdum,  owing  to  the 
completeness  hypothesis,  whereas  a  predicate 
with  an  empty  internal  disjunction  leads  to  a 
tautology. 

The  knowledge  base  acquired  by  REGAL 
consists  of  a  flat  set  of  concept  descriptions  in 
disjunctive  normal  form: 

(pi  V  (p2  V  ....  V  (pn  ->  h  (3.7) 

where  h  denotes  a  concept  and  the  (pj’s  (1  ^  i  ^ 
n)  are  conjuctions  of  predicates,  possibly 
containing,  in  turn,  internal  disjunctions.  In 
the  following  we  will  describe  the  language 
template  and  the  method  for  mapping  each 
conjunction  tpj  to  a  frxed-length  bit  string. 

3.1.  The  Language  Template 

As  mentioned  in  Section  2,  the  language  L  is 
characterized  by  a  language  template  A,  which 
is  a  conjunctive  formula  such  that  every  other 

conjunctive  formula  tpj  of  L  can  be  obtained 
by  deleting  some  literal  from  A^.  In  REGAL 
the  template  A  obeys  the  following  syntax: 

As(i)(xi, ...  ,Xn)  A  (3.8) 

'3  yj, ... ,  y^j,  [y(xj, ... ,  x„,  yj, ... ,  ym)]. 


2  In  this  paper  "literal"  is  used  to  refer  to  a  single 
constant  in  an  internal  disjunction,  because  this 
last  can  be  transformed  into  a  disjunction  of 
literals,  as  shown  by  (3.1)  and  (3.2) 
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In  (3.8)  both  9  and  \|f  denote  conjunctions  of 
predicates;  moreover,  each  predicate  in  9 
must  contain  at  least  one  variable  in  the  set  y^, 
y2*  —  » yin»  ^  (3-4).  The  template  A  is,  then, 
partitioned  into  two  subformulas.  A***  and  A*, 
corresponding  to  the  positive  and  negated 
parts  of  A,  respectively.  In  Fig.  3.1  an 
example  of  a  language  template  is  reported, 
for  the  sake  of  illustration. 

A  s  colour(x,  [red,  blue,  *])  a 
shape(x,  [square,  triangle,  *])  a 
-I  ^  [colour(y,  [red,  blue,  *])  a 
far(x,  y,  [0, 1, 2, 3,  *])] 

A''’s  colour(x,  [red,  blue,  *])  a 
shape(x,  [square,  triangle,  *]) 

A*  s  -i3y  [colour(y.  [red,  blue,  *])  a 
far(x,y,[0,l,2,3,*])] 

Fig.  3.1  -  Example  of  a  language 
template  including  the  unary  predicates 
“colour”  and  “shape”  and  the  binary 
predicate  “far”.  The  template  describes  a 
scene  in  which  no  object  y,  of  any  color, 
is  at  any  distance  from  an  object  x  of 
any  color  and  shape. 

Any  predicate  occurring  in  9  or  9  is  in  mgf. 
Each  9i,  occurring  in  any  concept  description 
(3.7),  is  a  particular  instantiation  of  A.  For 
example,  the  formula: 

colour(x,  [red])  a  shape(x,[square])  a  (3.9) 
-i3y  [coIour(y,  [blue])  a  fai<x,  y,  [2,3])] 

is  an  instantiation  of  the  template  reported  in 
Fig.  3.1.  Formula  (3.9)  describes  a  scene  in 
which  no  blue  object  has  a  distance  value  of  2 
or  3  from  a  red  square. 

As  discussed  before,  each  predicate  in  A'*’  is 
tautologically  true,  and,  thus,  A*^  is  also 
tautologically  true.  On  the  contrary.  A'  is 


tautologically  false  and,  hence,  A  is 
tautologically  false.  As  a  consequence,  a 
predicate  in  mgf  can  be  deleted  from  the 
positive  part  of  a  formula  without  changing  its 
extension,  whereas  a  predicate  with  an  empty 
internal  disjunction  can  be  deleted  from  the 
negated  part  of  a  formula  without  changing  its 
extension. 

In  the  language  L  there  may  also-  exist 
predicates  which  do  not  have  internal 
disjunction  (for  instance,  equal(x,y)).  These 
predicates  are  suggested  and  added  to  the 
language  template  by  the  background 
knowledge,  but  are  not  considered  during  the 
inductive  learning  process,  because  they 
represent  necessary  constraints  that  must 
always  be  present  in  every  concept  description 
and  are  not  processed  by  the  generalisation 
mechanism  of  REGAL. 

32.  Mapping  Formulas  to  Bit  Strings 

The  language  template  can  be  represented 
using  a  bit  string  s(A),  where  each  literal 
(term)  occurring  in  the  maximal  internal 
disjunction  of  a  predicate  in  A  is  associated  to 
a  corresponding  bit  in  s(A).  Predicates  in  A 
which  do  not  have  internal  disjunction 
(necessary  constraints)  do  not  need  to  be 
associated  to  any  bit  in  the  string. 

By  keeping  adjacent  in  s(A)  the  bits 
corresponding  to  literals  that  are  adjacent  in 
the  template,  s(A)  will  be  partitioned  into  two 
parts  s(A''0  and  s(A"),  corresponding  to  A'*' 
and  A',  respectively.  The  bit  string  associated 
to  the  template  of  Fig.  3. 1  is  reported  in  Fig. 
3.2.  Any  other  formula  in  L,  obtained  by 
deleting  some  literal  from  A ,  can  be 
represented  by  the  bit  string  s(A),  in  which  the 
values  of  the  bits  have  been  properly  set 


239 


Fig.  32  -  Bit  string  associated  to  the  language  template  reported  in  Bg.  3. 1 


This  last  point  needs  a  separate  discussion  for 
the  positive  and  negative  parts  of  the  template. 

The  semantic  interpretation  of  the  alleles  in 
the  bit  string  has  been  defined  on  the  basis  of 
the  previous  considerations.  In  particular,  for 
the  positive  part  of  a  formula,  if  the  bit 
corresponding  to  a  given  term  v  in  a  predicate 
P  is  set  to  1,  then  v  belongs  to  the  current 
internal  disjunction  of  P,  whereas  if  it  is  set  to 
0,  it  does  not  belong  to  it  Hence,  a  substring 
containing  all  O's  for  a  predicate  is  illegal  and 
it  is  automatically  rewritten  as  a  string  of  all 
I's.  On  the  contrary,  for  the  negated  part  of  a 
formula,  the  semantic  interpretation  is  the 
opposite:  setting  to  1  a  bit  corresponding  to  a 
given  term  v  means  that  v  is  absent  from  the 
corresponding  internal  disjunction,  whereas 
setting  it  to  0  means  that  it  is  present  Again,  a 
substring  containing  all  O's  for  a  predicate  is 


illegal,  whereas  a  substring  containing  all  I's 
corresponds  to  the  maximal  generality  for 
aconjunct.  Then,  the  system  again  replaces  a 
string  of  all  O's  with  a  string  containing  all  I's, 
whenever  it  occurs. 

In  Fig.  3.3  the  bit  string  corresponding  to  the 
formula  of  Fig.  3.2  is  reported. 

3  J.  Integrating  REGAL  in  a  Deductive 
Learning  Framework 

Several  authors  suggested  to  use  an  inductive 
procedure  to  refine  concept  descriptions 
obtained  by  EBG  with  imperfect  domain 
theories.  In  (Bergadano  &  Giordana,  1988), 
for  instance,  tentative  concept  descriptions 
aregenerated  from  non-operational  ones,  using 
a  possibly  incomplete  and/or  inconsistent 
domain  theory  and  a  set  of  learning  events. 


colour(x,  [red])  a  shape(x,[squarej)  A-»3y  [colour(x,  [bluel)Afar(x,y,[2,3])] 


Fig.  33  -  Bit  string  corresponding  to  the  formula  reponed  in  Fig.  3.2. 
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The  deductive  engine,  diat  porforms  EBG  with 
many  examples  at  the  same  time,  is  similar  to 
those  used  in  deductive  databases.  In  this  way, 
those  concept  descriptions,  which  need  to  be 
refined  because  they  are  inconsistent,  can  be 
detected  immediately.  The  inductive 
procedure  is  similar  to  the  one  that  was  used 
later  in  FOIL  (Quinlan,  1990). 

An  improvement  of  the  method  described  in 
(Bergadano  &  Giordana,  1988)  was  proposed 
later  by  the  same  authors  (Bergadano  et  al., 
1989).  The  inductive  refinement  of  a  concept 
description  can  be  made  more  effective  by 
telling  the  system  which  parts  of  the  domain 
theory  are  incomplete.  This  is  done  by 
extending  the  classical  Horn  clause  language 
with  a  special  construct  called  “predicate  set”, 
which  is  represented  using  the  following 
syntax: 

{Pi,P2,...,P„}-»Q  (3.10) 

where  Pj,  P2, ... ,  P„  and  Q  denote  predicates 
with  their  variables.  Expression  (3. 10)  means 
that  a  definition  for  Q  was  unknown  and  that  it 
must  be  found  by  induction  using  the 
description  language  defined  by  the  predicates 
occurring  in  the  set  {Pi,  P2, ... ,  Pn).  Predicate 
sets  are  recognized  by  the  deductive  engine 
and  are  left  in  proofs  as  unresolved  literals. 
Then,  the  inductive  procedure  removes  the 
possible  inconsistencies  in  the  proofs  by 
performing  a  search  in  the  space  defined  by 
the  predicate  sets  occurring  there;  as  a  last 
step,  predicate  sets  are  eliminated  from  the 
concept  descriptions. 

In  this  paper,  we  propose  to  use  the  internal 
disjunction  formalism  to  describe  predicate 
sets.  The  advantages  of  this  choice  is  that  the 


deductive  engine  of  ML-SMART  (Bergadano 
et  al.,  1988, 1991)  can  be  used  to  generate  the 
ten^)late  A  from  a  domain  theory.  Afterwards, 
REGAL  can  be  invoked  as  an  inductive 
procedure  to  refine  concept  descriptions. 

4.  The  Fitness  Function 

The  fimess  function  gives  an  evaluation  of 
how  well  a  formula  q>  describes  a  concept  h. 
Three  main  criteria  are  usually  adopted  to 
evaluate  the  quality  of  a  concept  description: 
consistency,  completeness  and  simplicity 
(Michalski,  1983).  The  same  criteria  are 
adopted  here  and  are  combined  in  the  fimess 
function.  In  the  following,  a  family  of 
empirical  fimess  functions,  new  with  respect 
to  the  one  used  in  GA-SMART,  is  presented 
and  analysed. 

Let  F  be  the  set  of  learning  examples,  and  let 
<p  be  a  candidate  description  of  the  concept  h; 
let,  moreover,  M”^(h)  and  M*(h)  denote  the 
numbers  of  positive  and  negative  instances  of 
h  in  F,  respectively.  Finally,  let  m'’’((p)  and 
m'(q))  be  the  numbers  of  positive  and  negative 
instances  of  h,  respectively,  belonging  to  F 
and  verifying  <p. 

As  a  measure  of  completeness  the  ratio  x  = 
m'^((p)/M'*’(h)  is  used,  whereas  the  consistency 
is  evaluated  by  w  =  m'’’((p)/[m'*’((p)  +  m'(9)], 
i.e.,  as  the  ratio  between  the  number  of 
positive  instances  and  the  global  number  of 
instances  verifying  9  (Bergadano  et  al.,  1988, 
1991).  However,  w  tends  to  give  a  too 
optimistic  evaluation,  especially  when  the 
number  of  available  negative  instances  is 
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Fig.  4.1  -  (a)  Plot  of  (1-  w)  versus  m'‘‘(<p)  and  m'((p).  (b)  Plot  of  m'((p)/M*(h)  versus 
m'*'((p)  and  m'((p).  The  values  M'^(h)  =  750  and  M'(h)  =  250  have  been  used. 


small  in  comparison  to  that  of  positive  ones. 
Suppose,  for  instance,  that  M'*'(h)  =  750  and 
M'(h)  =  250;  a  formula  covering  all  positive 
instances  and  all  negative  instances  will  have 
w  =  0.75,  even  if  it  is  totally  useless. 

Therefore,  we  looked  for  a  measure  more 
severely  penalising  inconsistency.  The 
currently  adopted  measure  is: 

y  =  Max  I  1-w,  ^  •  (4. 1) 

I  m^)i 

In  Fig.  4.1(a)  and  4.1(b)  the  plots  of  (1-  w) 
and  m'((p)/M‘(h)  versus  m‘^((p)  and  m’(<p)  are 
reported.  The  plot  of  y  versus.  m‘‘‘(<p)  and  m' 
(tp)  is  reported  in  Fig.  4.2. 

Finally,  the  simplicity  of  a  formula  is  equated 
to  its  syntactic  generality  and  is  measured  by  z 
=  n(l)/n(s),  where  n(l)  is  the  number  of  Ts  in 
a  string  s(A),  and  n(s)  is  the  total  number  of 
bits  in  s(A).  This  may  seem  a  simplistic 
evaluation,  but  proved  to  work  well  in  several 
test  cases. 


The  three  measures  introduced  above  are 
combined  into  a  unique  function  f((p),  defined 
as  follows: 

f(<p)  *  x“  (1  -  yP)  +  A  (e  ®  ^  - 1)  +  D  (4.2) 

where  A,  D  «  1  and  B,  a,  P  <  1  are  user- 
defined  parameters.  The  resulting  surface 
representing  f((p)  in  a  domain  with  M'(h)  = 
250  and  M'*’(h)  =  750  is  reported  in  Fig.  4.3. 
We  notice  that  the  actual  syntactic  deHnition 
of  the  function  f  is  not  fundamental.  What  is 
important  is  the  qualitative  shape  of  the 
corresponding  surface.  The  proposed  function 
is  the  result  of  a  series  of  trials;  its 
computation  time  is  irrelevant  with  respect  to 
the  matching  time  required  to  evaluate  m'^ftp) 
and  m'((p).  The  small  value  D  has  been  added 
for  the  following  reason:  when  the  population 
is  randomly  initialized,  it  is  possible  that  x  be 
zero,  thus  making  f((p)  zero,  which  hinders  (p 
from  being  selected  for  reproduction.  A  value 
of  f((p)  different  from  zero,  even  if  small. 
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gives  to  (p  a  chance  of  being  selected  the 

reproduction  operator.  Genetic  Operators 


Fig.  -  Plot  of  the  y  function,  defined 
by  (4.1),  versus  m‘'’(9)  and  m'(<p),  with 
M*^(h)=  750  and  M’(h)  »  250. 
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Fig.  4  J  -  Plot  of  the  fitness  function  f(<p), 
defined  by  (4.2),  vs.  vs.  m'*’((p)  and  m'(<p), 

with  M+(h)  =  750  and  M*(h)  =  250.  The 
values  us^  for  the  parameters  are:  a  =  P  = 
0.2,  A  =  0.002,  B  =  0.4  and  D  =  0.000001. 


The  fundamental  genetic  operators  used  by 
REGAL  are  the  classical  ones  used  in  the 
literature,  with  the  addition  of  two  non¬ 
standard  crossovers  and  a  new  form  of 
mutation  called  seeding,  which  will  be 
described  later. 

Crossover  operators  are  inherited  by  GA- 
SMART;  in  particular,  they  are  the  two-point 
crossover  and  the  uniform  crossover, 
previously  used  in  the  literature  (De  Jong, 
1975;  Syswerda,  1989),  and  the  generalising 
and  specialising  crossovers,  specifically 
designed  for  the  task  at  hand.  As  described  by 
Goldberg  (1989a),  the  two-point  crossover 
creates  two  new  offsprings  by  exchanging  two 
corresponding  substrings  randomly  selected  in 
the  parents.  In  the  uniform  crossover,  the 
information  is  exchanged  between  the  two 
parents  in  such  a  way  to  give  the  same  chance 
of  permutation  to  every  bit  position.  The 
parent  strings  s^  and  S2  are  scanned  fiom  left 
to  right  and  a  probability  p  =  0.5  of 
exchanging  the  values  is  given  to  each  pair  of 
corresponding  bits.  The  two  crossovers  have 
been  selected  empirically  after  experimenting 
with  several  data  sets. 

The  generalising  and  specialising  crossovers 
need  additional  explanations.  As  described  in 
Fig.  3.2,  the  string  s(A)  is  divided  into 
substrings,  each  one  corresponding  to  a 
specific  predicate  Pj.  In  both  crossover  types,  a 

set  D  of  predicates  is  randomly  selected  in  A. 
The  specialising  crossover  works  as  follows: 
(a)  The  substrings  in  Sj  and  $2,  corresponding 

to  the  predicates  not  selected  in  D,  are  copied 
unchanged  into  the  corresponding  offsprings 

s'l  and  s'2.  (b)  For  each  predicate  e  ID ,  a 
new  substring  s'j  is  generated  by  AND-ing  the 
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bits  of  the  conesponding  substring  S](Pj)  and 
S2(Pi).  The  substring  s',  is  then  copied  in  both 
s\  and  s'2-  The  generalising  crossover  differs 
from  the  specialising  one  in  that  the  new 
predicate  s'i  is  obtained  by  OR-ing  the 

corresponding  bits  of  s^  and  $2-  The  set  D  is 
created  by  assigning  an  equal  probability  p^  of 
being  selected  to  each  predicate  Pj. 

Given  a  pair  of  strings  <Si,S2>.  generated  by 
the  mating  procedure,  crossover  will  be 
applied  with  an  assigned  probability  p^.  Then, 
the  specific  crossover  type  is  selected 
statistically  by  taking  into  account  the  features 
of  si  and  S2.  The  probabilities  Pu  of  uniform 
crossover,  P2pt  of  two-point  crossover,  p^  of 
specialising  crossover  and  Pg  of  generalising 
crossover  are  assigned  through  the  following 
set  of  functions: 

p„  =  (l-af)'b 

P2pt*  (l-a.f)  (l-b)  (5.1) 

Ps  =  afr 
Pg  =  af-(l-r) 

In  equations  (5.1),  a  and  b  are  tunable 
parameters,  f  =  [f(si)  +  f(S2)]/2  is  the  mean 
value  of  the  fitness  of  the  two  strings  Sj  and  S2, 
and  r=[(m'^(Si)  +  m'(si)  +  m'^(s2)  +  m"(S2)]  /  [(M'*' 
+  M')  •  2]  is  the  mean  value  of  the  ratio 
between  the  number  of  instances  covered  by 
(Pi  and  (p2<  respectively,  and  the  global  number 
of  instances  in  the  training  set,  where  (pi  and 
(p2  are  the  formulas  associated  to  the  bit 
strings  Sj  and  S2.  When  a  formula  does  not 
cover  any  instance,  then  r  =  0  and 
specialisation  is  never  applied;  when  a 
formula  covers  all  the  instances,  then  r  =  1  and 
generalisation  is  never  applied. 

The  seeding  operator  is  primarily  used  to 
initialise  the  population  in  order  to  start  with  a 
set  of  formulas  covering  at  least  some 
examples  in  F.  In  fact,  the  concept  description 


language  characterised  by  a  tenq)late  can  be  so 
large  that  a  randomly  generated  population 
may  have  no  one  individual  matching  any 
element  in  F.  In  this  case,  the  fimess  is  close 
to  zero  for  all  individuals  and  the  search 
reduces  to  a  random  walk  for  a  long  initial 
phase,  until  formulas  covering  a  few  examples 
are  discovered  by  chance.  The  seeding 
operator  receives  in  input  a  string  s 
(corresponding  to  a  formula  <p)  and  returns  a 
modified  string  s',  which  covers  at  least  one 
instance  randomly  selected  from  F.  Therefore, 
the  action  performed  by  this  operator 
resembles  the  selection  of  the  seed  in  the  Star 
methodology  (Michalski,  1983;  Gemello  et  al., 
1991).  Finally,  seeding  operator  can  also  be 
used  by  REGAL,  in  alternative  to  classical 
mutation,  in  order  to  reintegrate  genetic 
information  lost  during  the  evolution  of  the 
program.  This  point  will  be  discussed  further 
in  the  next  section. 

6.  Learning  Multimodal  Concepts 

The  fundamental  novelty  of  REGAL  is  its 
ability  to  learn  disjunctive  concepts.  Learning 
disjunctive  concepts  is  a  problem  inherently 
deceptive  for  a  genetic  algorithm.  In  fact,  each 
separate  disjunct  in  the  concept  description 
corresponds  to  a  local  maximum  of  the  frmess 
function.  Therefore,  the  genetic  algorithm  will 
frequently  try  to  apply  crossover  between 
individuals  representative  of  different 
disjuncts,  creating  thus  offsprings  that  are 
necessarily  worst  than  the  parents  (because  a 
consistent  common  generalization  does  not 
exist). 

Two  strategies  have  been  experimented  in 
order  to  deal  with  this  problem.  The  first  one 
is  an  adaptation  of  a  commonly  used 
technique,  which  suggests  to  learn  a  disjunct 
at  a  time  (Michalski,  1980:  Bergadano  et  al., 
1988;  Quinlan,  1990).  The  second  one,  based 
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on  the  theory  of  the  niches  and  species  (Deb  & 
Goldberg,  1987),  tries  to  learn  a  set  of 
complete  and  consistent  disjunctive 
descriptions  by  encouraging  the  formation  of 
subpopulations.  Both  techniques  will  be 
discussed  and  compared  using  an  artiticial 
domain,  where  a  difficult  mutimodal  concept 
has  been  constmcted. 

6.1.  Learning  one  Disjunct  at  a  Time 

The  test  application  has  been  designed  by 
extending  the  well  known  train  set  used  by 
Michalski  (1980).  Also  in  the  present  case,  we 
have  two  concepts  to  distinguish:  trains  going 
East  and  trains  going  West  Therefore,  each 
learning  event  is  represented  as  a  sequence  of 
items  (coaches),  each  one  described  by  a 
vector  of  attributes  referring  to  shape,  colour, 
position,  length,  number  of  wheels  and 
number  of  loads.  Thousands  of  trains  have 
been  generated  by  a  program  which  selects  at 
random  the  values  of  the  attributes.  Then,  each 
train  has  been  classified  using  a  set  of 
disjunctive  rules.  The  challenge  for  REGAL 
was  to  discover  the  original  rules  or  a  set  of 
equivalent  ones. 

The  rules  for  classifying  trains  going  East  (the 
first  concept)  are  reported  in  the  following: 

Class  1  -  Trains  going  East 
Rule  1:  In  second  position  there  is  an  open-top 
small  coach,  carrying  one  load,  followed  by 
an  open-top  small  coach. 

Rule  2:  In  third  position  there  is  a  closed-top 
small  coach  carrying  one  load  and  an  open 
top  small  coach  in  fifth  position. 

Rule  3:  In  position  two,  three  or  four  there  is  a 
small  coach,  with  two  wheels  and  carrying 
one  load,  immediately  followed  by  a  long 
white  coach  carrying  one  load. 


The  learning  set  used  for  the  experiment 
contained  SOO  instances  of  Class  1  and  SOO 
instances  of  Class  2.  Rule  1  covered  98 
instances  of  Class  1,  Rule  2  covered  206  and 
Rule  3  covered  209.  The  three  subsets  were 
slightly  overlapping,  because  13  instances 
verified  more  than  one  rule. 

The  concept  description  language  used  by 
REGAL  is  very  similar  to  the  one  described  in 
(Michalski,  1980)  and  is  reported  in  Table  I. 

Using  the  strategy  of  learning  a  disjunct  at  a 
time,  REGAL  was  able  to  solve  the  problem. 
In  panicular,  it  learned  a  complete  and 
consistent  description  of  Class  1,  consisting  of 
three  disjuncts,  covering  216,  200  and  90 
positive  examples  and  roughly  corresponding 
to  Rule  3, 2  and  1,  respectively.  For  example, 
the  description  of  the  largest  disjunct  (216 
examples),  is  the  following: 

coach(x)  A  coach(y)  a  follows(x,y)  a 
length(x,  [1])  A  Nload(x,  [1])  a 
Iength(y,[2])  a  Nload(y,  [1,3])  a 
colour(y,  [white]) 

These  results  have  been  obtained  using  a 
population  of  800  individuals  initialized  by  the 
seeding  operator.  The  genetic  evolution  has 
been  controlled  using  a  linear  fimess  scaling 
mechanism  (Goldberg,  1989a)  and  a 
generation  gap  of  35%,  i.e.,  only  about  one 
third  of  the  individuals  was  replaced  at  each 
generation. 

The  system  was  ran  repeatedly,  in  order  to 
find  a  single  partial  definition  each  time,  until 
all  the  training  instances  of  the  target  concept 
were  covered.  In  particular,  at  each  run,  the 
system  was  let  free  to  converge  to  some 


Table  1 

Predicates  and  lem|daie  characterizing  the  concept  descripdoo  language  used  by  REGAL  f<^  leaning 

the  concept  of  trains  going  East 


Predicates  i 


d  into  the  bit-string 


Position(x,  [0, 1, 2, 3, 4])  The  position  number  of  the  coach, 

starting  from  the  engine  (position  0) 

Length(x,  [1. 2])  Length  of  the  coach  (Short  or  Long) 

Whccls(x,  [2, 3])  Number  of  wheels 


Nloads(x,  [0,1,2,*]) 

Colour(x,  [yl,  wh,  rd,  gr,  gy,  bk]) 
Shape(x,  [ot,  en,  us,  or,  cr,  el,  jt,  h,  si]) 
Distant(x,y,  [0, 1,  *]) 

Constraint  predicates 

Follow(x,y) 

Coach(x) 


Number  of  loads 
yl  s  yellow,  wh  =white,  etc. 
ot  =  open-top,  en  =  engine,  etc. 
Number  of  coaches  between  x  and  y 

Comment 

Item  y  follows  item  X 
Item  X  is  a  coach 


Template 

Coach(x)  A  Position(x,  [0, 1, 2,  3, 4])  a  Length(x,  [1, 2])  AWheels(x,  [2, 3])  a 
Nloads(x,  [0, 1, 2,  *])  a  Colour(x,  [yl,  wh,  rd,  gr,  gy,  bk])  a 
Shape(x,  [ot,  en,  us,  or,  cr,  el,  jt,  he,  si])  a 
Coach(y),A  Position(x,  [0, 1, 2, 3, 4])  a  Len^(y,  [1, 2])  a  Wheels(y,  [2, 3])  a 
Nloads(y,  [0, 1, 2,  *])  a  Colour(y,  [yl,  wh,  rd,  gr,  gy,  bk])  a 
Shape(y,  [ot,  en,  us,  or,  cr,  el,  jt,  he,  slj)  a 
Position(z,  [0, 1, 2, 3, 4])  a  Length(z,  [1, 2])  a  Whccls(z,  [2, 3])  a 
Nloads(z,  [0, 1, 2,  *])  a  Colour(z,  [yl,  wh,  id,  gr,  gy,  bk])  a 
Shape(z,  [ot,  en,  us,  or,  cr,  el,  jt,  he,  si])  a 
Follow(x,y)ADistant(x,y,  [0, 1,  *])ADistant(y,z,  [0, 1,  *])ADistant(x,  z,  [0, 1,  ♦]) 


concept  definition  9,  which,  of  course, 
covered  only  one  subset  9*  of  the  positive 
instances  of  Class  1.  Then,  the  instances  in  9* 
were  removed  from  F  and  the  system  was 
restarted.Leaming  one  disjunct  at  a  time  is 
probably  the  most  easy  way  to  cope  with  the 
deceptiveness  of  the  problem.  In  fact,  as 
deception  is  due  to  the  simultaneous  presence 
of  competing  disjuncts,  removing  them  as 
soon  as  they  are  discovered  make  easier  the 
task  of  learning  the  remaining  ones. 

6.2.  Learning  Many  Disjuncts  at  One  Time 

Two  approaches  are  proposed  in  the  literature 
to  ease  the  formation  of  subpopulations: 
crowding  (De  Jong,  1975)  and  use  of  sharing 


Junctions  (Goldberg  &  Richardson,  1987). 
Crowding  is  a  variant  of  the  basic  algorithm 
with  respect  to  reproduction  and  replacement 
of  older  individuals.  In  particular,  the  new 
individuals,  generated  by  crossover  and 
mutation,  replace  the  older  ones  that  are  most 
similar  to  them,  according  to  a  given  similarity 
measure.  In  this  way,  sub-populations  are 
likely  to  grow  up  because  genetic  pressure 
tends  to  manifest  itself  primarily  among 
similar  individuals.  Both  in  GA-SMART  and 
in  REGAL  this  method  proved  to  work  well  to 
learn  many  concepts  at  one  time,  but  was 
unable  to  allow  a  stable  formation  of 
subpopulations,  representative  of  disjunctive 
definitions  of  the  same  concept.  In  all  the 
experiments  performed,  in  the  long  term  there 
was  a  disjunct  overcoming  the  other  ones.  The 
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interpretation  we  give,  in  terms  of  the 
deceptiveness  of  the  problem,  is  that  stronger 
disjuncts  inhibit  the  reproduction  of  the  other 
ones  by  means  of  unfruitful  matings. 

The  method  based  on  sharing  fimctions,  unlike 
crowding,  tries  to  act  on  the  reproduction 
probability  in  order  to  inhibit  the  excessive 
growth  of  the  genetic  pressure  of  a 
subpopulation.  This  is  done  by  reducing  the 
fimess  of  an  individual,  depending  on  the 
number  of  existing  individuals  similar  to  it  In 
the  initial  formulation,  proposed  by  Goldberg 
&  Richardson  (1987),  genotypical  sharing  was 
considered.  The  fimess  value  f((p),  associated 
to  an  individual  (p,  was  considered  as  a  reward 
from  the  environment  to  be  shared  with  other 
individuals,  proportionally  to  their  similarity 
degree  with  (p.  Similarity  between  two 
individuals  <p  and  (p'  was  evaluated  as  the 
Hamming  distance  d(s,s')  between  the 
corresponding  bit  strings  s  and  s'. 

However,  in  many  cases  it  is  better  to  consider 
a  semantic  distance,  i.e.,  the  phenotypical 
distance,  rather  than  the  syntactic  one,  as  it  has 
been  discussed  in  (Deb  &  Goldberg,  1989). 
For  instance,  by  referring  to  our  problem,  it  is 
easy  to  find  formulas  apparently  similar  but 
having  a  very  different  extension  of  the 
learning  set  F.  Therefore,  we  tried  to  design  a 
proper  mechanism  for  sharing  fimess,  being 
the  one  described  in  (Goldberg  &  Richardson, 
1987)  not  suitable  to  our  task.  The  philosophy 
underlying  the  sharing  function  approach  is 
that  subpopulations  (species)  live  by 
exploiting  environmental  niches.  If  a  species 
proliferates  too  much,  it  will  be  limited  by  the 


implicit  reduction  of  the  pro-capite  incoming 
from  the  niche  it  exploits. 

In  REGAL,  learning  events  are  considered  as 
life  sources  that  are  exploited  by  the  formulas 
covering  them.  A  formula  <p  matching  m'^(<p) 
positive  events  takes  its  support  from  them  in 
order  to  evaluate  its  fimess.  However,  if  the 
same  events  are  matched  also  by  other 
formulas,  the  fimess  of  (p  is  consequently 
reduced,  because  (p  is  not  essential  to  cover 
such  events.  In  this  way,  the  reproduction  rate 
decreases  when  formulas  become  too 
redundant 

The  algorithm  used  to  evaluate  the  frmess, 
shared  by  competing  formulas,  can  be  easily 
understood  using  the  following  metaphor: 
concept  instances  are  cakes  and  formulas  are 
living  being  eating  cakes: 

1)  After  crossover,  mutation  and  seeding, 
formulas  are  evaluated  using  the  function 
described  in  Section  4  for  computing  their 
absolute  fimess. 

2)  Formulas,  soned  according  to  their  absolute 
fimess,  are  allowed  to  eat  cakes.  Each  one 
takes  one  serving  from  each  one  of  the  cakes 
associated  to  the  positive  instances  it  covers.  If 
all  servings  have  been  already  eaten,  it  will 
not  have  any. 

3)  For  each  formula  (p  the  shared  fimess  fsh(<P) 
is  evaluated  according  to  the  following 
expression: 

f,h(<P)  =  f(<P)E/m+(<p)  (6.1) 


247 


600 


50C 


40C 


30C 


20C 


IOC 


II 


I 


li  ll 


II 


1  2  3  4  S  6  7  8  9  10  11  12  13  U  15  16  17  18  19  20  21 


■  Cardinality  of  the  single  disjunction 

■  Positive  Instances  globally  covered 
•  Negative  Instances  globally  covered 

Fig.  6.1  •  Results  obtained  using  the  fitness  scaling  method  with  a  population  of  800 
individuals,  using  a  crossover  probability  p^  »  0.5  and  a  generation  gap  of  35%. 


where  E  represents  the  total  amount  of  serving 
eaten  by  ip  and  m^((p)  the  number  of  positive 
instances  covered  by  9. 

If  an  individual  cannot  eat  at  all,  it  will  have  a 
shared  fitness  equal  to  zero  and  then  will  not 
reproduce. 

Experimentation  with  the  test  case  described 
above  showed  that  the  method  was  able  to 
control  reproduction  rate  in  order  to  allow  the 
formation  of  stable  subpopulations.  The 
results  obtained  running  REGAL  through  200 
generations  and  with  a  global  population  of 
800  individuals,  are  reponed  in  Fig.  6.1.  Black 
columns  in  the  histogram  represent  the 
numbers  of  positive  instances  covered  by  each 
one  of  the  first  21  disjuncts,  soned  as  follows: 
first,  the  consistent  disjuncts  (1-16)  are  sorted 


according  to  decreasing  completeness  and, 
dien,  inconsistent  disjuncts  (17-21)  are  sorted 
according  to  increasing  inconsistency.  Dashed 
(dotted)  columns  report  the  global  number  of 
positive  (negative)  instances  covered  by  all  the 
disjuncts  from  the  first  until  the  current  one. 

It  is  wonh  noting  that  the  first  six  disjuncts 
cover  450  of  the  500  positive  instances  of 
Class  1.  They  are  in  a  static  equilibrium,  being 
positioned  on  F  in  such  a  way  that  each  one 
exploits  a  good  amount  of  instances  without 
competitors.  Such  disjuncts  were  present  in 
the  population  since  the  60th  generation.  On 
the  contrary,  the  12  disjuncts  from  9  to  21 
were  in  a  kind  of  dynamic  equilibrium,  being 
in  hard  competition  for  survival.  They  some 
time  disappeared,  to  reappear  later,  when 
regenerated  by  the  genetic  evolution.  In 
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particular,  the  presence  of  the  small  disjuncts 
from  10  to  16  is  due  to  the  continuous  creadon 
of  the  seeding  operator. 

We  nodce  that  several  large  disjuncts  are 
present,  some  of  them  corresponding 
approximately  to  the  definitions  given  by 
Rules  2  and  3;  others  were  discovered  by 
performing  alternative  kinds  of  generalization. 
For  instance,  the  first  disjunct  corresponds  to 
the  following  definition  obtainable  by 
generalising  Rule  3: 

coach(x)A  ]ength(x,  [1])a  toad(y,  [1,3])a 
colour(y,  [wh])  a  follow(x,y)  (6.2) 

Formula  (6.2)  is  consistent  with  the  data  and 
covers  216  positive  instances  instead  of  206. 
This  generalization  has  been  made  possible 
because  the  learning  set  was  not  large  enough 
to  be  representative  of  all  the  possible  cases. 
Increasing  the  number  of  instances  of  the 
second  class,  this  overgeneraiisation  would 
disappear. 

On  the  other  hand,  we  notice  the  lack  of  a 
unique  disjunct  corresponding  to  the  98 
instances  classifred  by  Rule  1.  We  explain  this 
fact  as  an  effect  of  the  deceptive  action  of  the 
large  disjuncts  which  interact  negatively  with 
the  growth  of  smaller  alternative  concept 
definitions. 

7.  Conclusions 

In  this  paper  we  presented  an  extension  of  a 
method  described  in  (Giordana  &  Sale,  1992) 
for  ’earning  concept  descriptions  in  first  order 
logic,  using  an  inductive  engine  based  on  a 
genetic  algorithm.  Several  substantial 
improvements  have  been  introduced.  First,  the 
concept  description  language  has  been 
extended  in  order  to  include  internal 
disjunction  and  negation.  Second,  we  have 


shown  how  this  genetic  learning  paradigm  can 
be  integrated  with  a  deductive  module  in  the 
very  same  way  as  in  (Bergadano  &  Giordana, 
1988). 

Two  techniques  have  been  introduced  to  learn 
disjunctive  concepts.  The  first  one  learns  one 
disjunct  at  a  time,  whereas  the  second  one 
allows  subpopulations  to  be  formed.  Even  if 
the  work  in  this  direction  is  still  in  an  early 
stage,  we  have  presented  a  sharing 
mechanism  which  proved  effective  in 
allowing  stable  subpopulations,  coiresponding 
to  different  disjuncts,  to  be  formed. 

References 

Bala  J.,  De  Jong  K.A.  and  Pachowicz  P., 
“Learning  Noise  Tolerant  Classification 
Procedures  by  Integrating  Inductive  Learning 
and  Genetic  Algorithms”,  Proc.  First 
International  Workshop  on  Multistrategy 
Learning  (Harpers  Ferry,  WV),  pp.  316-323, 
1991. 

Bergadano  F.  and  Giordana  A.,  “A 
Knowledge  Intensive  Approach  to  Concept 
Induction”,  Proc.  5th  Machine  Learning 
Conference  (Ann  Arbor,  MI),  pp.  305-317, 
1988. 

Bergadano  F.,  Giordana  A.  and  Ponsero  S., 
“Deduction  in  Top-Down  Inductive 
Learning”,  Proc.  6th  Machine  Learning 
Workshop  (Ithaca,  NY),  pp.  23-25, 1989. 

Bergadano  F.,  Giordana  A.  and  Saitta  L., 
“Automated  (joncepi  Acquisition  in  Noisy 
Environments”.  IEEE  Trans,  on  Pattern 
Analysis  and  Machine  Intelligence,  PAMI-10. 
555-578,  1988. 

Bergadano  F.,  Giordana  A.  and  Saitta  L., 
Machine  Learning:  An  Integrated  Approach 
and  its  Application,  Ellis  Horwood, 
Chichester,  UK,  1991. 

Deb  K.  and  Goldberg  D.,  “An  Investigation 
of  Niche  and  Species  Formation  in  Genetic 
Function  Optimization”,  Proc.  3rd  Int.  Conf. 


on  Genetic  Algorithms,  Fairfax,  VA,  pp.  42- 
50, 1989. 

De  Jong  K.  A.,  “Analysis  of  the  Behaviour  of  a 
Qass  of  Genetic  Adaptive  Systems”,  Doctoral 
Dissertation,  Department  of  Computer  and 
Communication  Sciences,  University  of 
Michigan,  Ann  Arbor,  MI,  1975. 

De  Jong  K.A.  and  Spears  W.M.,  “Learning 
Concept  Classification  Rules  Using  Genetic 
Algorithms”,  Proc.  IJCAl-91,  Sidney, 
Australia,  pp.  651-656, 1991. 

Gemello  R.,  Mana  F.  and  Saitta  L.,  “RIGEL; 
An  Inductive  Learning  System”,  Machine 
Learning,  (L  7-36, 1991. 

Giordana  A.  and  Sale  C.,  “Genetic  Algorithms 
for  Learning  Relations”,  Proc.  9th  Int.  Conf. 
on  Machine  Learning,  Aberdeen,  Scotland, 
pp.  169-178,  1992. 

Goldberg  D.E.  Genetic  Algorithms,  Addison- 
Wesley,  1989a. 

Goldberg  D.E.,  “Sizing  Populations  for  Serial 
and  Parallel  Genetic  Algorithms”,  Proc.  3rd 
Int.  Conf.  on  Genetic  Algorithms,  Fairfax,  VA, 
pp.  70-79, 1989b. 

Goldberg  D.E.  and  Richardson  J.,  “Genetic 
Algorithms  with  Sharing  for  Multimodal 
Function  Optimization”,  Proc.  2nd  Int.  Conf. 
on  Genetic  Algorithms,  Cambridge,  MA,  pp. 
41-49,  1987. 

Janikov  C.Z.,  “A  New  System  for  Inductive 
Learning  in  Attribute-Based  Spaces”,  Lecture 
Notes  in  Artificial  Intelligence,  542.  378-388, 
1991. 

Holland  J.H.,  “Escaping  Brittleness:  The 
Possibilities  of  General  Purpose  Learning 
Algorithms  Applied  to  Parallel  Rule-Based 
Systems”.  In  R.  Michalski,  J.  Carbonell  &  T. 
Mitchell  (Eds.),  Machine  Learning:  An  AI 
Approach,  Vol.  II.  Morgan  Kaufmann,  Los 
Altos,  CA,  pp.  593-623, 1986. 

Michalski  R.,  “Pattern  Recognition  as  Rule- 
Guided  Inductive  Inference”,  IEEE 
Transactions  on  Pattern  Analysis  and 


Machine  Intelligence,  P AMI-2.  349-361, 
1980. 

Michalski  R.,  “A  Theory  and  Methodolo^  of 
Inductive  Learning”.  In  R.  Michalski,  J. 
Carbonell  &  T.  Mitchell  (Eds.),  Machine 
Learning:  An  AI  Approach,  Vol.  I.  Morgan 
Kaufinann,  Los  Altos,  CA,  pp.  83-134,  1983. 

Mitchell  T.  “Generalization  as  Search”, 
Artificial  Intelligence,  13.,  203-226, 1982. 

Mitchell  T,  Keller  R.M.,  Kedar-Cabelli  S., 
“Explanation-Based  Generalization:  A  Unify¬ 
ing  View”,  Machine  Learning,  L.  47-80, 
1986. 

Pazzani  M.  and  Kibler  D.,  “The  Utility  of 
Knowledge  in  Inductive  Learning”,  Machine 
Learning,  2.  57-94,  1992. 

Quinlan  J.R.,  “Learning  Logical  DeHnitions 
from  Relations”,  Machine  Learning.  5,  239- 
266, 1990. 

Vafaie  H.  and  De  Jong  K.A.,  “Improving  the 
Performance  of  Rule  Induction  System  Using 
Genetic  Algorithms”,  Proceedings  First 
International  Workshop  on  Multistrategy 
Learning,  Harpers  Ferry,  VA,  pp.  305-315, 
1991. 


250 


Incremental  Genetic  Programming  and  Neural  Net  Learning: 

A  Case  Study 


Hugo  de  Cans 

Brain  Builder  Group, 

Evolutionary  Systems  Department, 

AIB.  Human  Information  Processing 
Research  Laboratories, 

2-2  Hikari-dai,  Seika-cho,  Soraku-gun, 
Kansai  Science  City, 

Kyoto,  619-02,  Japan. 

tel :  +  81  7749  5  1079,  fax :  +  81  7749  5  1408, 
email :  degaris(S>hip.atr.co.jp 


Abstract 

This  paper  investigates  whether  an  incremental 
approach  to  Genetic  Programming  (i.e.  using 
Genetic  Algorithms  to  build/evolve  complex 
systems)  (de  Garis  1990,  1993)  is  possible. 
The  vehicle  used  to  explore  this  question  is 
that  of  a  simple  mapping  of  binary  input 
vectors  to  binary  output  vectors.  Since  this 
mapping  is  learned  using  Neural  Networks, 
Genetic  Algorithms,  and  Incrementalism  - 
Incremental  Genetic  Programming  (or 
Incremental  Evolution)  can  be  considered  a 
form  of  Multistrategy  Learning. 

Keywords  :  Multistrategy  Learning, 
Incremental  Evolution,  Genetic  Programming 
(GP),  Incremental  GP,  Genetic  Algorithms 
(GAs),  GenNets  (Genetically  Programmed 
Neural  Network  Modules),  Artificial  Nervous 
Systems,  Biots  (Biological  Robots). 
Darwinian  Robotics,  1000-GenNet  Biots, 
GenNet  Accelerators,  GenNet  Shaping, 
Cellular  Automata  (CAs),  CA  Networks, 
Neurite  Networks,  CA  Neurons,  CA 
Machines,  Darwin  Machines,  CREEPER. 


1.  Introduction 

This  paper  takes  a  first  step  in  the  direction  of 
what  the  author  calls  "Incremental  Evolution", 
and  situates  it  in  the  context  of  Multistrategy 
Learning.  A  growing  number  of  people  around 
the  world  (especially  those  working  in  the  field 
of  Artificid  Life)  are  now  realizing  that  the  rise 
of  ULSI  (ultra  large  scale  integrated)  circuits, 
and  future  molecular  scale  technologies  will 
probably  necessitate  an  evolutionary  approach 
to  complex  system  building,  radier  Aan  the 
traditional  approach  of  human  design 
(blueprinting).  The  evolutionary  building  of 
complex  systems  has  been  labeled  "Genetic 
Programming"  by  the  author  (de  Garis  1990, 
1993).  Within  our  lifetimes,  there  will  be  so 
many  components  in  systems,  that  they  will 
not  be  designable,  because  these  systems  will 
have  become  too  complex  (too  many 
components,  too  many  complex  non  linear 
interactions).  Artificial  brains  and  artincial 
embryos  are  such  examples.  Such  systems 
will  have  to  self  assemble  and  be  evolved  to 
overcome  the  complexity  barrier.  (The  beauty 
of  GP,  is  that  the  internal  complexity  of  the 
system  which  is  successfully  evolved,  is 
irrelevant.  This  is  because  the  Genetic 


251 


Algorithm,  which  is  the  underlying  tool  of 
GP,  does  not  care  about  the  complexity  of  the 
systems  it  evolves,  so  long  as  the  fitness 
values  of  the  evolving  systems  keep 
increasing). 

In  the  Brain  Builder  Group  at  ATR,  we  hoi^ 
to  build/evolve  artificial  brains  using  Genetic 
Programming  techniques  in  special  hardware 
known  as  Darwin  Machines.  Thinking  about 
how  to  do  this  led  the  author  directly  to  the 
problem  discussed  in  this  paper,  namely  -  'Ts 
it  possible  to  do  Genetic  ^ogramming 
incrementally?"  This  question  will  probably 
become  very  important  in  the  next  few  years, 
because  if  one  evolves  a  given  system  SI, 
which  has  a  given  level  of  complexity  and 
functionality  (e.g.  a  robot  "kitten"  with  an 
anificial  brain,  giving  it  100  "behaviors"),  and 
one  then  wishes  at  some  later  time  to  evolve  a 
more  sophisticated  system  S2  (e.g.  a  robot 
"cat"  with  1000  "behaviors"),  does  one  just 
throw  away  the  first  system  SI  and  all  the  man 
years  of  work  that  went  into  it,  or  is  it  possible 
to  evolve  S2  using  S 1  as  a  base,  i.e.  can  one 
evolve  S2  incrementally  from  SI  (as  happens 
in  nature).  The  author  believes  that  in  the 
1990s,  there  will  be  strong  economic  pressure 
to  solve  the  "incremental  GP  problem". 

Having  stressed  the  importance  of  the  question 
(i.e.  can  one  GP  incrementally),  this  paper 
makes  an  initial  attempt  at  providing  an 
answer,  by  taking  a  multistrategy  learning 
approach  to  the  problem  of  mapping  binary 
input  vectors  to  binary  output  vectors.  More 
specifrcally,  it  shows  how  a  multistrategy 
learning  approach  was  applied  to  the  task  of 
generating  an  associative  memory  in  an 
increment^  fashion.  The  particular  task  chosen 
to  illustrate  this  multistrategy  approach  is 
merely  an  illustration.  The  emphasis  of  this 
paper  is  on  "incremental  evolution",  and  not 
upon  neural  networks,  nor  associative 
memory,  nor  neural  network  learning 
algorithms  which  add  neurons  to  the  net 
incrementally.  "Incremental  evolution"  is  a 
new  topic,  and  has  little  to  do  with  the  already 
substantial  literature  on  what  is  conventionally 
called  "incremental  learning"  (i.e.  the 
incremental  generation  of  classes,  given  one-at- 
a-time  presentation  of  elements  to  be 
classified),  despite  the  similarity  of  the  two 
topic  labels.  The  simple  example  chosen  here 


to  illustrate  the  possibility  or  otherwise  of 
incremental  GP,  was  to  generate  a  1:1  nuq}ping 
between  binary  input  and  binary  output 
vectors,  where  the  input/output  vector  pairs 
were  not  all  supplied  at  once.  Three  strategies 
were  combined  to  perform  this  learning  task. 
The  first  strategy  employed  was  to  use  a  neural 
network  as  the  vehicle  to  generate  the 
association.  The  second  strategy  was  to  train 
the  neural  network  using  an  evolutionary 
approach  (i.e.  using  a  Genetic  Algorithm).  The 
third  strategy  was  to  perform  this  evolution  in 
an  incremental  manner.  Thus  this  paper 
introduces  the  "Incremental  Evolution 
Problem"  and  how  a  multistrategy  learning 
approach  can  help  solve  it  More  concretely,  a 
Genetic  Algorithm  was  employed  to  evolve  the 
weight  values  of  a  fully  connected  neural 
network  (called  a  "GenNet"  (de  Garis  1990, 
1993)  which  initially  contained  N  neurons  to 
perform  T  tasks.  The  results  (i.e.  the  evolved 
weights  of  the  N  neurons)  were  then  taken, 
and  to  this  neural  network  were  added  a  few 
more  neurons  dN,  to  evolve  the  performance 
of  a  few  more  tasks  dT.  This  p^r  investigates 

(a)  whether  this  can  be  done  at  all  (the  most 
important  question  in  view  of  the  above 
discussion  on  incremental  evolution,  i.e. 
incremental  GP), 

(b)  and  if  so,  as  a  possible  bonus,  whether  it 
might  be  faster  to  evolve  an  N+dN 
GenNet  performing  T+dT  tasks 
incrementally  (i.e.  [N,T],  then  [N+dN, 
T+dT]),  than  to  do  N+dN  from  scratch. 

(c)  how  the  two  approaches  Q.e  from  scratch 
or  incremental)  compare  in  task 
performance  quality. 

It  is  believed  by  the  author  that  the  concept  of 
Incremental  Evolution  will  become  increasingly 
imponant  as  more  research  groups  attempt  to 
build  anificial  nervous  systems  (ANS)  using 
evolved  neural  networks  as  modules.  This  type 
of  work  is  now  going  on  in  at  least  four  labs 
(as  far  as  the  author  is  aware)  around  the 
world,  i.e.  the  author's  "Brain  Builder  Group" 
at  ATR,  Beer's  ^oup  at  Case  Western  Reserve 
University,  Arbib's  group  at  the  University  of 
Southern  (California,  and  Cliff  et  al’s  group  at 
Sussex  University.  Sooner  or  later,  all  of  these 
groups  will  have  to  face  the  question  of 


Incremental  Evolution,  Le.  "Whether  it  is  better 
(i.e.  easier,  quicker)  to  scrap  an  earlier,  simple 
artificial  nervous  system  (ANS),  by 
building/evolving  a  newer,  bigger  ANS  from 
scratch,  OR,  whether  it  is  better  to  add 
components  to  the  already  existing  ANS,  i.e. 
to  build/evolve  incrementally?"  Nature  was 
forced  to  build  incrementally  because  it  did  not 
have  the  luxury  to  scrap  an  earlier  design.  Each 
step  in  nature's  evolutionary  path  h^  to  be 
from  one  viable  design  to  another 
(incrementally  modified)  but  equally  viable 
design. 

2 .  The  Experiments 

In  an  attempt  to  answer  the  three  questions 
above  (i.e.  (a),  (b),  (c)),  a  GenNet  of  12  (fully 
connected)  neurons  was  evolved  (using  a 
Genetic  Algorithm)  which  mapped  4-bit  input 
vectors  to  4-bit  output  vectors.  The  four  input 
neurons  were  distinct  from  the  four  output 
neurons.  During  all  of  these  experiments,  the 
input/output  ([I]/[0])  pairs  that  were  used  were 
taken  from  the  following  S  :-  ([10101/10110], 
[lllOMOOlO],  [0011]/[01001,  (1100]/[1001] 
and  [11011/[0011]).  Real  input  or  output 
values  less  than  0.2  were  arbitrahly  interpreted 
to  be  a  binary  "0”,  and  input  or  output  values 
greater  than  0.4  were  arbitrarily  interpreted  to 
be  a  binary  "1".  To  measure  the  fimess  of  an 
evolving  GenNet,  the  following  approach  was 
used.  For  a  given  binary  input  vector,  e.g. 
[1010],  its  desired  or  target  output  was  [0110], 
according  to  the  above  list.  The  input  vector 
[1010]  and  desired  output  vector  were 
convened  into  their  respective  input  and  output 
signal  values,  to  become  [0.5,  0.1,  0.5,  0.1] 
and  [0.0,  0.7,  0.7,  0.0].  (The  0.5  and  0.7 
values  were  chosen  because  experience 
showed  they  facilitated  GenNet  dynamics). 
The  four  input  vector  component  values  were 
interpreted  to  be  the  (clamped)  external  input 
values  of  the  four  input  neurons  of  the  GenNet 
being  evolved.  If  the  outputs  at  the  100th  cycle 
(giving  plenty  of  time  for  transients  to  die  out 
and  the  outputs  to  stabilize)  were  [0.23, 0.35, 
0.56,  -0.18],  then  the  fitness  of  this  output 
was  defined  to  be  :- 

fimess  =  20,000  -  1000*(0.23  -  0.0)2 
-  1000*(0.7  -  0.35)2  +  (0.56  -  0.7)2 
+  (0.0  -  -0.18)2 


More  fonnally,  where  vtarg  is  ^  desired  or 
target  output  value,  vout  is  the  actual  ouq)ut 
value,  and  dF  is  tte  fitross  contribudon  :* 

^  Vtarg  * 

[THEir[IF  Vout  <  00* 

^  *  +  (0.0  -  Vout  )^^ 

ELSE  [IF  Vout  >0  0* 

<iF*-1000*(vout  -0.0)21) 


<0.7, 

dF  =  -1000*(0.7-vout)2J 

ELSE  [IF  vout  >0.7, 

‘“'“  +  (vout  *  0.7)21) 

These  fimess  contribudons  were  defined  so 
that  Vout  values  lying  outside  the  "0"  and  "1" 
regions  were  heavily  penalized,  and  so  that 
Vout  values  lying  inside  these  regions  were 
rewarded  milcUy,  so  as  to  push  the  vout  values 
towards  -t-l.O  and  -1.0  (corresponding  to 
binary  values  "1"  and  "0"  respectively).  If  P 
input  vectors  were  presented  to  the  GenNet, 
the  total  dmess  dednidon  was  defined  to  be 

1  P  ^ 

fimess  *  p(I  (20000  +  J^dFij)) 

j=l  i=i 

where  i  ranged  over  the  four  components  of  a 
single  output  vector,  and  j  ran^  over  the  P 
input  vectors.  A  brief  description  of  the 
evolution  of  a  GenNet  is  now  given,  for  those 
,  not  already  femiliarwidi  it 

This  description  has  appeared  already  in  many 
publications  (e.g.  de  Garis  1990,  1993).  A 
GenNet  is  a  Genetically  Programmed  Neural 
Network.  Genetic  Programming  is  defined  to 
be  the  use  of  Genetic  Algorithms  (GAs)  as 
builders  of  complex  systems.  A  GA  was  used 
to  evolve  the  wei^t  values  and  signs  of  neural 
connections  so  that  the  ouq)uts  of  the  GenNet 
took  desired  values  (or  in  some  experiments, 
controlled  some  system  in  desired  ways).  A 
GenNet  is  a  fully  connected  network. 

For  a  connection,  and  hence  weight  (wij) 
between  neuron  "i"  and  neuron  "j",  one  bit  is 


Vtarg  “0-7 
(THElf[IF 


Vout 


used  for  the  weight's  sign  (0  for  an  excitatory 
synapse,  1  for  an  inhibitory  synapse),  and  6  to 
8  bits  are  used  for  the  wei^t  (assumeid  to  have 
an  absolute  value  less  than  1.0,  and  which  is 
expressed  as  a  binary  fraction).  Thus  if  the 
number  of  bits  "B"  used  to  express  the  weight 
is  6,  then  the  binary  string  1110100  is 
equivalent  to  the  weight  -0.8125  Hence  for  a 
GenNet  of  N  neurons,  the  number  of  bits  in 
the  chromosome  which  specifies  all  the 
weights  and  signs  of  the  connections  between 
the  N  neurons,  will  be  N*N*(B+1).  Each  sign 
and  weight  is  concatenated  onto  the 
chromosome. 
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Thus  the  connection  (sign  and  weight)  between 
neuron  "i"  and  neuron  "j"  will  be  expressed  by 
the  (i*N  +  j)th  group  of  (B+1)  bits.  These 
(B+1)  bits  on  the  chromosome  are  called  a 
"slot".  An  initial  population  of  randomly 
generated  chromosomes  of  this  length  is  used 
to  evolve  the  required  GenNets.  We  turn  now 
to  some  initial  results.  FIG.l  shows  the  fitness 
rise  as  a  fimcdon  of  the  number  of  generadons, 
for  the  three  cases  of  2,  3  and  4  input/output 
pairs,  as  listed  earlier.  In  the  2  and  3  pair 
cases,  the  final  fitness  values  stabilized  at 
values  thought  to  be  "acceptable",  where 
"acceptable"  was  defined  to  mean  that  if  the 
target  output  was  0.0,  then  any  output  less  than 
0.2  was  passable,  and  if  the  target  output  was 
0.7,  then  any  output  greater  than  0.4  was 


passable.  The  ouq)ut  values  at  the  100th  cycle 
of  the  elite  chromosomes  are  shown  in  FIG.2. 
The  4  input/output  pair  case,  shows  that  for  a 
12-Neuron  GenNet,  4  pairs  are  too  many. 
Note  that  for  each  of  these  3  graphs  the  2,  3 
and  4  I/O  pair  case  GenNets  were  evolved 
fiom  scratch,  i.e.  there  was  no  incremental 
evoludon  used  A  16-Neuron  GenNet  was  then 
evolved  with  4  and  S  I/O  pairs  respectively. 
The  4  I/O  pair  case  evolved  "acceptably",  but 
the  5  VO  pair  case  did  not  See  FIG.3  for  the 
fimess  growths  of  the  4  and  5  VO  pair  cases. 
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FIG.2  OUTPUT  VALUES 
for  12-Neuron  GenNets  (2,3,4  PAIRS) 


FIG.4  shows  the  output  values  at  the  100th 
cycle  of  the  elite  chromosomes  for  the  two 
cases.  The  5  VO  pair  case,  shows  that  for  a  16- 
Neuron  GenNet,  5  pairs  were  too  many.  Note 
that  for  each  of  these  2  graphs  the  4  and  S  I/O 
pair  case  GenNets  were  evolved  from  scratch, 
i.e.  there  was  no  incremental  evolution  used 

The  chromosome  population  which  resulted 
from  the  evolution  of  the  12-Neuron,  3  VO  pair 
case,  was  then  "inserted"  into  the  initial 
population  in  an  experiment  to  evolve 
(incrementally)  a  16-Neuron,  4  VO  pair  case. 
However  these  (16-Neuron  GenNet) 
chromosomes  needed  to  be  lengthened  in  order 
to  specify  the  signs  and  weights  of  the 
connections  with  the  4  extra  neurons.  These 
signs  and  weights  (i.e.  slots)  of  the 
connections  between  the  original  12  and  the 
extra  4  neurons  were  concatenated  onto  the 
original  chromosomes. 


nG.3  FITNESS  EVOLUTIONS 
for  16-Neuron  GenNets  (4, 5  PAIRS) 


HG.  4  OUTPUT  VALUES 
for  16-Neuron  GenNets  (4,  5  PAIRS) 


FIG.5  shows  how  this  was  done.  In  any 
region,  slots  were  read  from  left  to  right  and 
then  from  top  to  bottom.  This  representation  of 
the  chromosome  was  chosen  for  two  rather 
obvious  reasons.  The  first  reason  was  that 
adding  funher  neurons  (and  hence  extending 
the  length  of  the  chromosome)  would  not 
change  the  interpretation  of  the  information 


contained  in  die  earlier  part  of  the  dmxnosome 
(e.g.  Region  "A"  is  interpreted  in  the  same 
way,  i.e.  codes  ftn-  die  connection  wei^ts  and 
signs  between  the  original  12  neurons,  even 
when  4  more  neurons  are  added). 

The  second  reason  was  that  diis  representation 
also  allows  incremental  evolution,  Le.  (xie  can 
"load”  or  insert  a  smaller  (earlim')  chnxnosome 
(which  results  from  an  earlier  phase  of 
evolution)  into  a  later,  larger  chromosome  for  a 
second  phase  of  evolution.  For  example,  in 
FIG.  S  one  could  load  the  first  12*12 -^ots 
(which  resulted  from  the  evolution  of  a  12- 
Neuron  GenNet)  into  Region  "A"  of  the  16*16 
slots  of  the  chromosome  representing  a  16- 
Neuron  GenNet 

Thus,  to  perform  the  incremental  evolution  of 
this  experiment,  the  12*12  (Region  "A") 
weight  matrix  which  resulted  from  the 
evolution  of  the  chromosome  of  the  12-Neuron 
GenNet  with  3  I/O  pairs,  was  loaded  into 
region  "A"  of  the  chromosomes  for  a  16- 
Neuron  GenNet  Actually  an  initial  population 
of  chromosomes  for  a  l^Neuron  GenNet  was 
randomly  generated,  followed  by  the 
overwriting  of  the  Region  "A"  of  each 
chromosome  of  that  16-Neuron  GenNet 
population  by  the  12-Neuron  GenNet 
chromosome  population. 

One  was  then  curious  to  see  if  the  evolution  of 
this  "incremented”  16-Neuron  (jenNet  using  4 
VO  pairs  would  occur  at  all,  and  if  so,  would  it 
occur  faster  than  the  evolution  from  scratch  of 
the  16-Neuron  GenNet  with  4  I/O  pairs,  as 
shown  in  FIG.  3.  Intuitively,  one  feels  that 
since  part  of  the  search  space  of  the 
incremented  16-Neuron  GenNet  has  already 
been  searched,  i.e.  the  "portion"  of  the  search 
space  of  the  12-Neuron  GenNet  that  is  inserted 
into  the  16-Neuron  GenNet  chromosomes, 
then  the  evolution  of  the  remaining  search 
space  which  is  added  with  the  addition  of  the  4 
new  neurons,  would  be  quicker  than  having  to 
evolve  the  whole  search  space  corresponcSng 
to  a  16-Neuron  GenNet  from  scratch. 

Whether  the  final  stabilized  fimess  value  of  the 
elite  incremented  chromosome  would  be  the 
same  as  that  for  the  evolution-from-scratch 
case  is  difficult  to  predict.  FIG.  6  shows  the 
results.  The  incremented  GenNet  obviously 
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evolved,  which  is  the  first  and  most  critical 
result.  Thus  a  GenNet  can  be  incrementally 
GPed.  Secondly,  it  evolved  mme  quickly  that 
the  "from  scratch"  case  (whose  fitness  curve  is 
copied  over  from  FIG.  3)  for  the  first  500 
generations  or  so.  Thirdly,  the  fimess  value 
stabilized  at  a  value  noticeably  lower  than  the 
"from  scratch"  case.  This  was  thought  to  be 
significant  It  will  be  interesting  to  see  whether 
o^er  researchers,  when  employing  incremental 
evolutionary  techniques  to  other  applications, 
observe  the  same  "better  sooner,  worse  later" 
phenomenon.  If  so,  then  this  new  "effect" 
might  be  worthy  of  being  given  a  name. 
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FIG.  5  INCREMENTED  WEIGHT 
MATRIX 

When  one  compares  the  time  taken  to  evolve  a 
12-Neuron,  3  VO  pair  GenNet  (Ti2,3p),  plus 
the  time  taken  to  incrementally  evolve  (from 
this  12-Neuron  GenNet)  a  16-Neuron,  3+1  I/O 
pair  GenNet  (Ti2+4,3+lp),  and  compare  this 
sum  (Ti2,3p  +Ti2+4,3+lp)  with  the  time 
taken  to  evolve  a  16-Neuron,  4  I/O  pair 
GenNet  from  scratch  (Ti6,4p),  one  observes 
from  FIGs.  1  and  6,  that  approximately  :- 

Tl2,3p  =  200  generations 

(fimess  >  1900.0), 


Tl2+43+lp  ■  350  generations 

(fitness  >  1900.0), 
Tl6,4p  -  500  generations 

(fimess  >  1900.0) 

Therefore,  Ti2,3p  +Ti2+4,3+lp  andTi6,4p 
are  of  comparable  size  (i.e.  550  and  500),  i.e. 
it  takes  about  the  same  time  to  perform  a  two 
step  incremental  evolution,  as  to  perform  the 
same  size  evolution  from  scratch.  BUT,  note 
that  Ti2+4,3+lp  «  quicker  than  Ti6,4p. 
These  comparative  times  are  debatable,  because 
they  depend  upon  the  fimess  levels  chosen 
which  define  when  evolutions  are  "completed". 
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3.  Conclusions  and  Future 
Work 

One  can  draw  some  tentative  conclusions  from 
the  above  work.  Firstly,  and  most  critically,  in 
this  one  case  at  least,  incremental  evolution 
worked  (but  the  final  quality  seemed  to  be 
lower  than  a  from-scratch  approach).  A  lot 
more  work  needs  to  be  done  by  other 
researchers  with  other  examples  to  check  if 
these  initial  results  obtained  by  the  author  are 
generally  true.  As  a  sideline,  it  also  appeared 
diat  the  total  time  taken  to  perform  the  inidal 
evolution  plus  the  incremented  evolution 
Cri2,3p  +  Ti2+4,3+lp)  was  roughly  equal  to 
the  from-scratch  evolution  (Ti6^4p).  So  does 
this  mean  that  future  research  teams  using  GP 
methods  to  build  their  artificial  nervous 
systems  have  a  choice  of  two  approaches,  i.e. 
either  to  increment  or  rebuild?  The  question 
remains  open.  However,  GIVEN  that  one 
already  has  an  evolved  GenNet,  it  is  quicker  to 
evolve  it  incrementally  than  to  stan  over  from 
scratch  (i.e.  Ti2+4,3+lp  <  Ti6,4p). 
Hopefully,  this  result  will  prove  to  be  general, 
and  will  apply  to  artificial  nervous  systems. 

At  the  time  of  writing,  the  author  is  attempting 
to  GP  artificial  nervous  systems  using  cellular 
automata  (CA)  as  a  base.  The  idea  is  to  grow 
CA  trails  (3  cells  wide),  by  sending  4  types  of 
"signals"  down  the  middle  of  the  trail.  When 
the  signals  hit  the  end  of  the  trail,  four  types  of 
action  can  occur,  depending  upon  the  type  (i.e. 
the  state  or  color)  of  the  signal  (red  =  turn  trail 
left,  green  =  turn  trail  right,  brown  =  extend 
trail  one  cell,  purple  =  split  trail  into  a  T 
intersection).  The  sequence  of  these  signals 
corresponds  to  a  chromosome  in  a  GA,  and 
hence  can  be  evolved.  There  is  thus  a  mapping 
between  the  sequence  and  a  CA  network. 
When  two  trails  collide,  a  "synapse"  is  formed 
which  absorbs  oncoming  signals. 

In  the  first  of  a  two  phase  process,  the  trails  are 
laid  down.  In  the  second  phase,  the  CA 
network  uses  a  second  set  of  CA  state 
transition  rules  to  make  the  network  behave 
like  a  neural  network,  whose  fitness  at 
performing  some  desired  task  can  be 
measured.  To  make  the  CA  network  behave 
like  a  neural  network,  CA  state  transition  rules 
need  to  be  defined  which  make  a  CA  system 


function  like  a  neuron  (with  addition  of 
dendrit .  signal  strengths,  and  axonal  ouqiut 
converslj:.'  Axon  signals  remain  at  constant 
strength,  ,\it  at  a  synapse,  dendritic  signal 
strengths  drop  off  as  a  function  of  the  distance 
from  the  synapse.  Thus  the  greater  the  distance 
between  the  axon/dendrite  synapse,  die  weaker 
the  signal  arriving  at  the  neuron.  Hence  the 
distances  correspond  to  the  weights  in 
conventional  neural  network  formulatitms.  But 
the  distances  can  be  evolved  in  the  first  phase. 
Hence,  using  GPed  "neuiite  network"  (a 
neurite  is  a  baby  neuron  which  grows)  bas^ 
on  CAs,  it  is  possible  to  both  grow  and  evolve 
neural  nets.  The  author’s  program  based  on 
these  ideas  is  called  "CREEPER”.  If  these 
Gl^d  "neurite  nets”  prove  to  be  evolvable  (i.e. 
their  fitnesses  improve  over  time),  then  the 
ideas  will  be  extended  from  2  to  3  (tensions, 
and  hopefully  later  to  a  hardware 
implementation,  using  CAMs  (i.e.  CA 
Machines).  By  having  a  population  of  these 
CAMS,  plus  local  micro  processors  to  measure 
the  fitnesses  and  to  control  the  GA  aspects  of 
the  evolution,  we  have  the  beginnings  of  a 
Darwin  Machine  design.  The  Brain  Builder 
Group  hopes  to  evolve  large  numbers  of  neural 
modules  (GenNets,  i.e.  GPed  neural  nets)  and 
connections  between  them,  using  such  Darwin 
Machines.  The  artificial  nervous  system  which 
results  will  be  destined  to  control  a  robot  kitten 
with  about  100  "behaviors".  It  will  take 
several  man  years  of  work  to  build  such  a  biot 
(biological  robot),  so  the  group  will  be  highly 
motivated  not  to  have  to  start  firom  scratch  each 
time  we  want  to  increase  the  biot's  capabilities. 
Hence  again,  the  question  raised  by  this  paper 
"How  to  GP  incrementally",  is  emphasiz^ 
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Abstract 

This  paper  presents  a  self-improving  reactive 
control  system  for  autonomous  robotic  naviga¬ 
tion.  The  navigation  module  uses  a  schema- 
based  reactive  control  system  to  perform  the 
navigation  task.  The  learning  module  combines 
case-based  reasoning  and  reinforcement  learn¬ 
ing  to  continuously  tune  the  navigation  system 
through  experience.  The  case-based  reason¬ 
ing  component  perceives  and  characterizes  the 
system’s  environment,  retrieves  an  appropriate 
case,  and  uses  the  recommendations  of  the  case 
to  tune  the  parameters  of  the  reactive  control 
system.  The  reinforcement  learning  component 
refines  the  content  of  the  cases  based  on  the  cur¬ 
rent  experience.  Together,  the  learning  com¬ 
ponents  perform  on-line  adaptation,  resulting  in 
improved  performance  as  the  reactive  control 
system  tunes  itself  to  the  environment,  as  well  as 
on-line  learning,  resulting  in  an  improved  library 
of  cases  that  capture  environmental  regularities 
necessary  to  perform  on-line  adaptation.  The 
system  is  extensively  evaluated  through  simula¬ 
tion  studies  using  several  performance  metrics 
and  system  configurations. 

Keywords:  Robot  navigation,  reactive  con¬ 
trol,  case-based  reasoning,  reinforcement  learn¬ 
ing,  adaptive  control. 

1  Introduction 

Autonomous  robotic  navigation  is  defined  as  the 
task  of  finding  a  path  along  which  a  robot  can 
move  safely  from  a  soince  point  to  a  destination 
point  in  an  obstacle-ridden  terrain,  and  executing 


the  actions  to  carry  out  the  movement  in  a  real 
or  simulated  world.  Several  methods  have  been 
proposed  for  this  task,  ranging  from  high-level 
planning  methods  to  reactive  control  methods. 

High-level  planning  methods  use  extensive 
world  knowledge  and  inferences  about  the  envi¬ 
ronment  they  interact  with  (Pikes,  Hart  &  Nils¬ 
son,  1972;  Sacerdoti,  1975).  Knowledge  about 
available  actions  and  their  consequences  is  used 
to  formulate  a  detailed  plan  before  the  actions  are 
actually  executed  in  the  world.  Such  systems  can 
successfully  perform  the  path-finding  required 
by  the  navigation  task,  but  only  if  an  accurate  and 
complete  representation  of  the  world  is  available 
to  the  system.  Considerable  high-level  knowl¬ 
edge  is  also  needed  to  learn  from  planning  expe¬ 
riences  (e.g.,  Hammond,  1989a;  Minton,  1988; 
Mostow  &  Bhatnagar,  1987;  Segre,  1988).  Such 
a  representation  is  usually  not  available  in  real- 
world  environments,  which  are  complex  and  dy¬ 
namic  in  nature.  To  build  the  necessary  repre¬ 
sentations,  a  fast  and  accurate  perception  pro¬ 
cess  is  required  to  reliably  map  sensory  inputs 
to  high-level  representations  of  the  world.  A 
second  problem  with  high-level  planning  is  the 
large  amount  of  processing  time  required,  result¬ 
ing  in  significant  slowdown  and  the  inability  to 
respond  immediately  to  unexpected  situations. 

Situated  or  reactive  control  methods  have  been 
proposed  as  an  alternative  to  high-level  plan¬ 
ning  methods  (Arkin,  1989;  Brooks,  1986;  Kael- 
bling,  1986;  Payton,  1986).  In  these  meth¬ 
ods,  no  planning  is  performed;  instead,  a  sim¬ 
ple  sensory  representation  of  the  environment 
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is  used  to  select  the  next  action  that  should  be 
performed.  Actions  are  represented  as  simple 
behaviors,  which  can  be  selected  and  executed 
rapidly,  often  in  real-time.  These  methods  can 
cope  with  unknown  and  dynamic  environmen¬ 
tal  configurations,  but  only  those  that  lie  within 
the  scope  of  predetermined  behaviors.  Further¬ 
more,  such  methods  cannot  modify  or  improve 
their  behaviors  through  experience,  since  they 
do  not  have  any  predictive  capability  that  could 
account  for  future  consequences  of  their  actions, 
nor  a  higher-level  formalism  in  which  to  repre¬ 
sent  and  reason  about  the  knowledge  necessary 
for  such  analysis. 

We  propose  a  self-improving  navigation  system 
that  uses  reactive  control  for  fast  performance, 
augmented  with  multistrategy  learning  methods 
that  allow  the  system  to  adapt  to  novel  environ¬ 
ments  and  to  learn  from  its  experiences.  The  sys¬ 
tem  autonomously  and  progressively  constructs 
representational  structures  that  aid  the  naviga¬ 
tion  task  by  supplying  the  predictive  capability 
that  standard  reactive  systems  lack.  The  repre¬ 
sentations  are  constructed  using  a  hybrid  case- 
based  and  reinforcement  learning  method  with¬ 
out  extensive  high-level  reasoning.  The  system 
is  very  robust  and  can  perform  successfiiUy  in 
(and  learn  from)  novel  environments,  yet  it  com¬ 
pares  favorably  with  traditional  reactive  meth¬ 
ods  in  terms  of  speed  and  performance.  A  fur¬ 
ther  advantage  of  the  method  is  that  the  system 
designers  do  not  need  to  foresee  and  represent 
all  the  possibilities  that  might  occur  since  the 
system  develops  its  OAvn  “understanding”  of  the 
world  and  its  actions.  Through  experience,  the 
system  is  able  to  adapt  to,  and  perform  well  in, 
a  wide  range  of  environments  without  any  user 
intervention  or  supervisory  input  This  is  a  pri¬ 
mary  characteristic  that  autonomous  agents  must 
have  to  interact  with  real-world  environments. 

This  paper  is  organized  as  follows.  Section  2 
presents  a  technical  description  of  the  system,  in¬ 
cluding  the  schema-based  reactive  control  com¬ 
ponent  the  case-based  and  reinforcement  learn¬ 
ing  methods,  and  the  system-environment  model 
representations,  and  places  it  in  the  context  of  re¬ 
lated  work  in  the  area.  Section  3  presents  several 
experiments  that  evaluate  the  system.  The  re¬ 
sults  shown  provide  empirical  validation  of  our 
approach.  Section  4  concludes  with  a  discus¬ 


sion  of  the  lessons  learned  from  this  research 
and  suggests  directions  for  future  research. 

2  Technical  Details 

2.1  System  Description 

The  Self-Improving  Navigation  System  (SINS) 
consists  of  a  navigation  module,  which  uses 
schema-based  reactive  control  methods,  and  an 
on-line  adaptation  and  learning  module,  which 
uses  case-based  reasoning  and  reinforcement 
learning  methods.  The  navigation  moduie  is  re¬ 
sponsible  for  moving  the  robot  through  the  envi¬ 
ronment  from  the  starting  location  to  the  desired 
goal  location  while  avoiding  obstacles  along  the 
way.  The  adaptation  and  learning  module  has 
two  responsibilities.  The  adaptation  sub-module 
performs  on-line  adaptation  of  the  reactive  con¬ 
trol  parameters  to  get  the  best  performance  from 
the  navigation  module.  The  adaptation  is  based 
on  recommendations  from  cases  that  capture  and 
model  the  interaction  of  the  system  with  its  en¬ 
vironment.  With  such  a  model,  SINS  is  able  to 
predict  future  consequences  of  its  actions  and 
act  accordingly.  The  learning  sub-module  mon¬ 
itors  the  progress  of  the  system  and  incremen¬ 
tally  modifies  the  case  representations  through 
experience.  Figure  1  shows  the  SINS  functional 
architecture. 

The  main  objective  of  the  learning  module  is 
to  construct  a  model  of  the  continuous  senso¬ 
rimotor  interaction  of  the  system  with  its  envi¬ 
ronment,  that  is,  a  mapping  from  sensory  in¬ 
puts  to  appropriate  behavioral  (schema)  param¬ 
eters.  This  model  allows  the  adaptation  module 
to  control  the  behavior  of  the  navigation  module 
by  selecting  and  adapting  schema  parameters  in 
different  environments.  To  learn  a  mapping  in 
this  context  is  to  discover  environment  config¬ 
urations  that  are  relevant  to  the  navigation  task 
and  corresponding  schema  parameters  that  im¬ 
prove  the  navigational  performance  of  the  sys¬ 
tem.  The  learning  method  is  unsupervised  and, 
unlike  traditional  reinforcement  learning  meth¬ 
ods,  does  not  rely  on  an  external  reward  func¬ 
tion  (cf.  Watkins,  1989;  Whitehead  &  Ballard, 
1990).  Instead,  the  system’s  “reward”  depends 
on  the  similarity  of  the  observed  mapping  in  the 
current  environment  to  the  mapping  represented 
in  the  model.  This  causes  the  system  to  converge 
towards  those  mappings  that  are  consistent  over 
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a  set  of  experiences. 

The  representations  used  by  SINS  to  model  its 
interaction  with  the  environment  are  initially 
under-constrained  and  generic;  they  contain  very 
little  useful  information  for  the  navigation  task. 
As  the  system  interacts  with  the  environment,  the 
learning  module  gradually  modifies  the  content 
of  the  representations  until  they  become  useful 
and  provide  reliable  information  for  adapting  the 
navigation  system  to  the  particular  environment 
at  hand. 

The  learning  and  navigation  modules  function 
in  an  integrated  manner.  The  learning  module  is 
always  trying  to  find  a  better  model  of  the  inter¬ 
action  of  the  system  with  its  environment  so  that 
it  can  tune  the  navigation  module  to  perform  its 
function  better.  The  navigation  module  provides 
feedback  to  the  learning  module  so  it  can  build 
a  better  model  of  this  interaction.  The  behavior 
of  the  system  is  then  the  result  of  an  equilib¬ 
rium  point  established  by  the  learning  module 
which  is  trying  to  refine  tiie  model  and  the  envi¬ 
ronment  which  is  complex  and  dynamic  in  na¬ 
ture.  This  equilibrium  may  shift  and  need  to  be 
re-established  if  the  environment  changes  dras¬ 
tically;  however,  the  model  is  generic  enough  at 
any  point  to  be  able  to  deal  with  a  very  wide 
range  of  environments. 

We  now  present  the  reactive  module,  the  repre¬ 
sentations  used  by  the  system,  and  the  methods 
used  by  the  learning  module  in  more  detail. 

12  The  Schema-Based  Reactive  Control 
Module 

The  reactive  control  module  is  based  on  the 
AuRA  architecture  (Arkin,  1989),  and  consists 


of  a  set  of  motor  schemas  that  represent  the  indi¬ 
vidual  motor  behaviors  available  to  the  system. 
Each  schema  reacts  to  sensory  information  from 
the  environment,  and  produces  a  velocity  vec¬ 
tor  representing  the  direction  and  speed  at  which 
the  robot  is  to  move  given  current  environmen¬ 
tal  conditions.  The  velocity  vectors  produced  by 
all  the  schemas  are  then  combined  to  produce  a 
potential  field  that  directs  the  actual  movement 
of  the  robot.  Simple  behaviors,  such  as  wan¬ 
dering,  obstacle  avoidance,  and  goal  following, 
can  combine  to  produce  complex  emergent  be¬ 
haviors  in  a  particular  environment  Different 
emergent  behaviors  can  be  obtained  by  mod¬ 
ifying  the  simple  behaviors.  This  allows  the 
system  to  interact  successfully  in  different  en¬ 
vironmental  configurations  requiring  different 
navigational  “strategies”  (Clark,  Arkin,  &  Ram, 
1992). 

A  detailed  description  of  schema-based  reac¬ 
tive  control  methods  can  be  found  in  Arkin 
(1989).  In  this  research,  w-  used  three  motor 
schemas:  Avoid-Static-Obstacle,  Move-To- 
Goal,  and  Noise.  Avoid-Static-Obstacle  di¬ 
rects  the  system  to  move  itself  away  from  de¬ 
tected  obstacles.  Move-To-Goal  schema  di¬ 
rects  the  system  to  move  towards  a  particular 
point  in  the  terrain.  The  NOISE  schema  makes 
the  system  to  wander  in  a  random  direction. 
Each  motor  schema  has  a  set  of  parameters  that 
control  the  potential  field  generated  by  the  mo¬ 
tor  schema.  In  this  research,  we  used  the  fol¬ 
lowing  parameters:  Obstacle-Gain,  associated 
with  Avoid-Static-Obstacle,  determines  the 
magnitude  of  the  repulsive  potential  field  gener¬ 
ated  by  the  obstacles  perceived  by  the  system; 
Goal-Gain,  associated  with  Move-To-Goal, 
determines  the  magnitude  of  the  attractive  po¬ 
tential  field  generated  by  the  goal;  Noise-Gain, 
associated  with  Noise,  determines  the  magni¬ 
tude  of  the  noise;  and  Noise-Persistence,  also 
associated  with  NoiSE,  determines  the  duration 
for  which  a  noise  value  is  allowed  to  persist. 

Different  combinations  of  schema  parameters 
produce  different  behaviors  to  be  exhibited  by 
the  system  (see  figure  2).  Traditionally,  param¬ 
eters  are  fixed  and  determined  ahead  of  time  by 
the  system  designer.  However,  on-line  selec¬ 
tion  and  modification  of  the  appropriate  param¬ 
eters  based  on  the  current  environment  can  en- 


Figure  2:  Typical  navigational  behaviors  of  dififeient  tun¬ 
ings  of  the  reactive  control  module.  The  figure  on  the  left 
shows  the  non-learning  system  with  high  obstacle  avoid¬ 
ance  and  low  goal  attraction.  On  the  right,  the  learning 
system  has  lowered  obstacle  avoidance  and  increased  goal 
attraction,  allowing  it  to  “squeeze”  through  the  obstacles 
and  then  take  a  relatively  direct  path  to  the  goal. 

hance  navigational  performance  (Clark,  Arkin, 
&  Ram,  1992;  Moorman  &  Ram,  1992).  SINS 
adopts  this  approach  by  allowing  schema  param¬ 
eters  to  be  modified  dynamically.  However,  in 
their  systems,  the  cases  are  supplied  by  the  de¬ 
signer  using  hand-coded  coded  cases.  Our  sys¬ 
tem,  in  contrast,  can  learn  and  modify  its  own 
cases  through  experience.  The  representation  of 
our  cases  is  also  considerably  different  and  is 
designed  to  support  reinforcement  learning. 

23  The  System-Environment  Model 
Representation 

The  navigation  module  in  SINS  can  be  adapted 
to  exhibit  many  different  behaviors.  SINS  im¬ 
proves  its  perf^ormance  by  learning  how  and 
when  to  tune  the  navigation  module.  In  this 
way,  the  system  can  use  the  appropriate  behav¬ 
ior  in  each  environmental  conhguration  encoun¬ 
tered.  The  learning  module,  therefore,  must 
learn  about  and  discriminate  between  different 
environments,  and  associate  with  each  the  ap¬ 
propriate  adaptations  to  be  performed  on  the 
motor  schemas.  This  requires  a  representational 
scheme  to  model,  not  just  the  environment,  but 
the  interaction  between  the  system  and  the  en¬ 
vironment.  However,  to  ensure  that  the  system 
does  not  get  bogged  down  in  extensive  high- 
level  reasoning,  the  knowledge  represented  in 
the  model  must  be  based  on  perceptual  and  mo¬ 
tor  information  easily  available  at  the  reactive 
level. 


Figure  3:  Sample  representations  showing  the  time  his- 
tc»y  of  analog  values  representing  perceived  inputs  and 
schema  parameters.  Each  graph  in  the  case  OkIow)  is 
matched  against  the  ct^respon^g  graph  in  the  current 
environment  (above)  to  determine  the  best  match,  after 
which  the  remaining  pan  of  the  case  is  used  to  guide  nav¬ 
igation  (shown  as  dashed  lines). 


SINS  uses  a  model  consisting  of  associations 
between  the  sensory  inputs  and  schema  param¬ 
eters  values.  Each  set  of  associations  is  rep¬ 
resented  as  a  case.  Sensory  inputs  provides 
information  about  the  configuration  of  the  en¬ 
vironment,  and  schema  parameter  information 
specifies  how  to  adapt  the  navigation  module  in 
the  environments  to  which  the  case  is  applica¬ 
ble.  Each  type  of  information  is  represented  as 
a  vector  of  analog  values.  Each  analog  value 
corresponds  to  a  quantitative  variable  (a  sensory 
input  or  a  schema  parameter)  at  a  specific  time. 
A  vector  represents  the  trend  or  recent  history 
of  a  variable.  A  case  models  an  association  be¬ 
tween  sensory  inputs  and  schema  parameters  by 
grouping  their  respective  vectors  together.  Fig¬ 
ure  3  show  an  example  of  this  representation. 

This  representation  has  three  essential  proper¬ 
ties.  First,  the  representation  is  capable  of  cap¬ 
turing  a  wide  range  of  possible  associations  be¬ 
tween  of  sensory  inputs  and  schema  parameters. 
Second,  it  permits  continuous  progressive  re¬ 
finement  of  the  associations.  Finally,  the  repre¬ 
sentation  captures  trends  or  patterns  of  input  and 
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output  values  over  time.  This  allows  the  sys¬ 
tem  to  detect  patterns  over  larger  time  windows 
rather  than  having  to  make  a  decision  based  only 
on  instantaneous  values  of  perceptual  inputs. 

In  this  research,  we  used  four  input  vectors 
tc  characterize  the  environment  and  discrim¬ 
inate  among  differeiit  environment  configura¬ 
tions:  Obstacle-Density  provides  a  measure 
of  the  occupied  areas  that  impede  navigation; 
Absolute-Motion  measures  the  activity  of  the 
system;  Relative-Motion  represents  the  change 
in  motion  activity;  and  Motion-Towards-Goal 
specifies  how  much  progress  the  system  has  ac¬ 
tually  made  towards  the  goal.  These  input  vec¬ 
tors  are  constantly  updated  with  the  information 
received  from  the  sensors. 

We  also  used  four  output  vectors  to  represent 
the  schema  parameter  values  used  to  adapt  the 
navigation  module,  one  for  each  of  the  schema 
parameters  (Obstacle-Gain,  Goal-Gain,  Nmse- 
Gain,  and  Noise-Persistence)  discussed  earlier. 
The  values  are  set  periodically  according  to  the 
recommendations  of  the  case  that  best  matches 
the  current  environment.  The  new  values  remain 
constant  until  the  next  setting  period. 

The  choice  of  input  and  output  vectors  was  based 
on  the  complexity  of  their  calculation  and  their 
relevance  to  the  navigation  task.  The  input  vec¬ 
tors  were  chosen  to  represent  environment  con¬ 
figurations  in  a  generic  manner  but  taking  into 
account  the  processing  required  to  produce  those 
vectors  (e.g.,  obstacle  density  is  more  generic 
than  obstacle  position,  and  can  be  obtained  eas¬ 
ily  from  the  robot’s  ultrasonic  sensors).  The 
output  vectors  were  chosen  to  represent  directly 
the  actions  that  the  learning  module  uses  to  tune 
the  navigation  module,  that  is,  the  schema  pa¬ 
rameter  values  themselves. 

2.4  The  On-Line  Adaptation  And  Learning 
Module 

This  module  creates,  maintains  and  applies  the 
case  representations  used  for  on-line  adapta¬ 
tion  of  the  reactive  module.  The  objective  of 
the  learning  method  is  to  detect  and  discrim¬ 
inate  among  different  environment  configura¬ 
tions,  and  to  identify  the  appropriate  schema 
parameter  values  to  be  used  by  the  navigation 
module,  in  a  dynamic  and  an  on-line  manner. 
This  means  that,  as  the  system  is  navigating. 


the  learning  module  is  perceiving  the  environ¬ 
ment,  detecting  an  environment  configuration, 
and  modifying  the  schema  parameters  of  the 
navigation  module  accordingly,  while  simulta¬ 
neously  updating  its  own  cases  to  reflect  the  ob¬ 
served  results  of  the  system’s  actions  in  various 
situations. 

The  method  is  based  on  a  combination  of  ideas 
from  case-based  reasoning  and  learning,  which 
deals  with  the  issue  of  using  past  experiences  to 
deal  with  and  learn  from  novel  situations  (e.g;, 
see  Kolodner,  1988;  Hammond,  1989b),  and 
from  reinforcement  learning,  which  deals  with 
the  issue  of  updating  the  content  of  system’s 
knowledge  based  on  feedback  from  the  environ¬ 
ment  (e.g.,  see  Sutton,  1992).  However,  in  tradi¬ 
tional  case-based  planning  systems  (e.g.,  Ham¬ 
mond,  1989a)  learning  and  adaptation  requires  a 
detailed  model  of  the  domain.  This  is  exactly 
what  reactive  planning  systems  are  trying  to 
avoid.  Earlier  attempts  to  combine  reactive  con¬ 
trol  with  classical  planning  systems  (e.g.,  Chien, 
Gervasio,  &  DeJong,  1991)  or  explanation- 
based  learning  systems  (e.g.,  Mitchell,  1990) 
also  relied  on  deep  reasoning  and  were  typically 
too  slow  for  the  fast,  reflexive  behavior  required 
in  reactive  control  systems.  Unlike  these  ap¬ 
proaches,  our  method  does  not  fall  back  on  slow 
non-reactive  techniques  for  improving  reactive 
control. 

To  effectively  improve  the  performance  of  the 
navigation  task,  the  learning  module  must  find  a 
consistent  mapping  from  environment  configu¬ 
rations  to  control  parameters.  The  learning  mod¬ 
ule  captures  this  mapping  in  the  learned  cases, 
each  case  representing  a  portion  of  the  map¬ 
ping  localized  in  a  specific  environment  con¬ 
figuration.  The  set  of  cases  represents  the  sys¬ 
tem’s  model  of  its  interactions  with  the  envi¬ 
ronment,  which  is  adapted  through  experience 
using  the  case-based  and  reinforcement  learn¬ 
ing  methods.  The  case-based  method  selects  the 
case  best  suited  for  a  particular  environment  con¬ 
figuration.  The  reinforcement  learning  method 
updates  the  content  of  a  case  to  reflect  the  cur¬ 
rent  experience,  such  that  those  aspects  of  the 
mapping  that  are  consistent  over  time  tend  to  be 
reinforced.  Since  the  navigation  module  implic¬ 
itly  provides  the  bias  to  move  to  the  goal  while 
avoiding  obstacles,  mappings  that  are  consis- 
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tently  observed  are  those  that  tend  to  produce 
this  behavior.  As  the  system  gains  experience, 
therefore,  it  improves  its  own  performance  at  the 
navigation  task. 

Each  case  represents  an  observed  regularity  be¬ 
tween  a  particular  environmental  configuration 
and  the  effects  of  different  actions,  and  pre¬ 
scribes  the  values  of  the  schema  parameters  that 
are  most  appropriate  (as  far  as  the  system  knows 
based  on  its  previous  experience)  for  that  en¬ 
vironment.  The  learning  module  performs  the 
following  tasks  in  a  cyclic  manner:  (1)  perceive 
and  represent  the  current  environment;  (2)  re¬ 
trieve  a  case  whose  input  vector  represents  an 
environment  most  similar  to  the  current  envi¬ 
ronment;  (3)  adapt  the  schema  parameter  val¬ 
ues  in  use  by  the  reactive  control  module  by  in¬ 
stalling  the  values  recommended  by  the  output 
vectors  of  the  case;  and  (4)  learn  new  associ¬ 
ations  and/or  adapt  existing  associations  repre¬ 
sented  in  the  case  to  reflect  any  new  information 
gamed  through  the  use  of  the  case  in  the  new 
situation  to  enhance  the  reliability  of  their  pre¬ 
dictions. 

A  detailed  description  of  each  step  would  re¬ 
quire  more  space  than  is  available  in  this  paper; 
however,  a  short  description  of  the  method  fol¬ 
lows.  The  perceive  step  builds  a  set  of  four  input 
vectors  E  one  for  each  sensory  input  j  de¬ 
scribed  earlier,  which  are  matched  against  the 
corresponding  input  vectors  C  of  the  cases 
in  the  system’s  memory  in  the  retrieve  step.  The 
case  similarity  metric  5M  is  based  on  the  mean 
squared  difference  between  each  of  the  vector 
values  (0  of  the  kth  case  over  a  trend¬ 
ing  window  Ic,  and  the  vector  values 
of  the  environment  E  over  a  trending  window  of 
a  given  length  Ie'- 

5M(E,C^,p)  = 

4  (E^^(i  +  p)  -  C...,(i))^ 

^  (mta((£-p,/c)-pp 

The  match  window  is  calculated  using  a 
reverse  sweep  over  the  time  axis  p  similar  to  a 
convolution  process  to  find  the  relative  position 
(represented  by  min(/E  -  pjc))  that  matches 
best.  The  best  matching  case  C*' satisfying 
the  equation: 

{^b..,,Phc<|min(5M(E,C'',p)),V/;,0<  p  <  h} 


is  handed  to  the  adapt  step,  which  selects  the 
schema  parameter  values  from  the  out¬ 

put  vectors  of  the  case  and  modifies  the  values 
currently  in  use  using  a  reinforcement  formula 
which  uses  the  case  similarity  metric  as  a  scalar 
reward.  Thus  the  actual  adaptations  performed 
depend  on  the  goodness  of  match  between  the 
case  and  the  environment,  and  are  given  by: 

min(/E  -  Pbe.,,  /c)x 
|1  -  iiSM|random(0,maxC*o^p 

where  RSM  is  the  relative  similarly  metric  dis¬ 
cussed  below.  The  random  factor  allows  the 
system  to  “explore”  the  search  space  locally  in 
order  to  discover  regularities,  since  the  system 
does  not  start  with  prior  knowledge  that  can  be 
used  to  guide  this  search. 

Finally,  the  learn  step  uses  statistical  informa¬ 
tion  about  prior  applications  of  the  case  to  de¬ 
termine  whether  information  from  the  current 
application  of  the  case  should  be  used  to  mod¬ 
ify  this  case,  or  whether  a  new  case  should 
be  created.  The  vectors  encoded  in  the  cases 
are  adapted  using  a  reinforcement  formula  in 
which  a  relative  similarity  measure  is  used  as 
a  scalar  reward  or  reinforcement  signal.  The 
relative  similarity  measure  RSM,  given  by 
{SM  -  SM^)/{SM  —  SMi^)  quantifies  how 
similar  the  current  environment  configuration  is 
to  the  environment  configuration  encoded  by  the 
case  relative  to  how  similar  the  environment  has 
been  in  previous  utilizations  of  the  case.  In¬ 
tuitively,  if  case  matches  the  current  situation 
better  than  previous  situations  it  was  used  in,  it 
is  likely  that  the  situation  involves  the  very  reg¬ 
ularities  that  the  case  is  beginning  to  capture; 
thus,  it  is  worthwhile  modifying  the  case  in  the 
direction  of  the  current  situation.  Alternatively, 
if  the  match  is  not  quite  as  good,  the  case  should 
not  be  modified  because  that  will  take  it  away 
from  the  regularity  it  was  converging  towards. 
Finally,  if  the  current  situation  is  a  very  bad  fit 
to  the  case,  it  makes  more  sense  to  create  a  new 
case  to  represent  what  is  probably  a  new  class  of 
situations. 

Thus,  if  the  RSM  is  below  a  certain  threshold 
(0. 1  in  this  paper),  the  input  and  output  case  vec¬ 
tors  are  updated  using  a  gradient  descent  formula 
based  on  the  similarity  measure: 

c*'«‘(0  = 
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amin(/£;  -  p,lc){Ej{i  +  p)  -  Cj^(t)), 
0<i<lc 

where  the  constant  a  determines  the  learning  rate 
(0.5  in  this  paper).  In  the  adapt  and  learn  steps, 
the  overlap  factor  min(/£:  —  Picujc)  is  used  to 
attenuate  the  modification  of  early  vdues  within 
the  case  which  contribute  more  to  the  selection 
of  the  current  case. 

Since  the  reinforcement  formula  is  based  on  a 
relative  similarity  measure,  the  overall  effect 
of  the  learning  process  is  to  cause  the  cases  to 
converge  on  stable  associations  between  envi¬ 
ronment  configurations  and  schema  parameters. 
Stable  associations  represent  regularities  in  the 
world  that  have  been  identified  by  the  system 
through  its  experience,  and  provide  the  predic¬ 
tive  power  necessary  to  navigate  in  future  situ¬ 
ations.  The  assumption  behind  this  method  is 
that  the  interaction  between  the  system  and  the 
environment  can  be  characterized  by  a  finite  set 
of  causal  patterns  or  associations  between  the 
sensory  inputs  and  the  actions  performed  by  the 
system.  Tbe  method  allows  the  system  to  learn 
these  causal  patterns  and  to  use  them  to  modify 
its  actions  by  updating  its  schema  parameters  as 
appropriate. 

Genetic  algorithms  may  also  be  used  to  mod¬ 
ify  schema  parameters  in  a  given  environment 
(Pearce,  Arldn,  &  Ram,  1992).  However,  while 
this  approach  is  useful  in  the  initial  design  of 
the  navigation  system,  it  caimot  change  schema 
parameters  during  navigation  when  the  system 
foces  environments  that  are  significantly  differ¬ 
ent  from  the  environments  used  in  the  training 
phase  of  the  genetic  algorithm.  Another  ap¬ 
proach  to  self-organizing  adaptive  control  is  that 
of  Verschure,  Krdse,  &  Pfeifer  (1992),  in  which 
a  neural  network  is  used  to  learn  how  to  associate 
conditional  stimulus  to  unconditional  responses. 
Although  their  system  and  ours  are  both  self¬ 
improving  navigation  systems,  there  is  a  funda¬ 
mental  difference  on  how  the  performance  of  the 
navigation  task  is  improved.  Their  system  im¬ 
proves  its  navigation  performance  by  learning 
how  to  incorporate  new  input  data  (i.e.,  condi¬ 
tional  stimulus)  into  an  already  working  naviga¬ 
tion  system,  while  SINS  improves  its  navigation 
performance  by  learning  how  to  adapt  the  system 
itself  (i.e.,  the  navigation  module).  Our  system 
does  not  rely  on  new  sensory  input,  but  on  pat¬ 


terns  or  regularities  detected  in  perceived  envi¬ 
ronment.  Our  learning  methods  are  also  similar 
to  Sutton  (1990),  whose  system  uses  a  trial-and- 
error  reinforcement  learning  strategy  to  develop 
a  world  model  and  to  plan  optimal  routes  using 
the  evolving  world  model.  Unlike  this  system, 
however,  SINS  does  not  need  to  be  trained  on 
the  same  world  many  times,  nor  are  the  results  of 
its  learning  specific  to  a  particular  world,  initial 
location,  or  destination  location. 

3  Evaluation 

The  methods  presented  above  have  been  eval¬ 
uated  using  extensive  simulations  across  a  va¬ 
riety  of  different  types  of  environment,  perfor¬ 
mance  criteria,  and  system  configurations.  The 
objective  of  these  experiments  is  to  measure 
qualitatively  and  quantitatively  improvement  of 
the  navigation  performance  of  SINS  (the  “adap¬ 
tive  system”),  and  to  compare  this  performance 
against  a  non-leaming  schema-based  reactive 
system  (the  “static  system”)  and  a  system  that 
changes  the  schema  parameter  values  randomly 
after  every  control  interval  (the  “random  sys¬ 
tem”).  Rather  than  simply  measure  the  improve¬ 
ment  in  performance  in  SINS  by  some  given 
metric  such  as  “speedup”,  we  were  interested  in 
systematically  evaluating  the  effects  of  various 
design  decisions  on  the  performance  of  the  sys¬ 
tem  across  a  variety  of  metrics  in  different  types 
of  environments.  To  achieve  this,  we  designed 
several  experiments,  which  can  be  grouped  into 
four  sets  as  discussed  below. 

3.1  Experiment  Design 

The  systems  were  tested  on  randomly  generated 
environments  consisting  of  rectangular  bounded 
worlds.  Each  environment  contains  circular  ob¬ 
stacles,  a  start  location,  and  a  destination  loca¬ 
tion,  as  shown  in  figure  2.  Figure  4  shows  an 
actual  run  of  the  static  and  adaptive  systems  on 
one  of  the  randomly  generated  worlds.  The  lo¬ 
cation,  number  and  radius  of  the  obstacles  were 
randomly  determined  to  create  environments  of 
varying  amounts  of  clutter,  defined  as  the  ra¬ 
tio  of  free  space  to  occupied  space.  We  tested 
the  effect  of  three  different  parameters  in  the 
SINS  system:  max-cases,  the  maximum  num¬ 
ber  of  cases  that  SINS  is  allowed  to  create;  case- 
length,  Ic,  representing  the  time  window  of  a 
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case;  and  control-interval,  which  determines 
how  often  the  schema  parameters  in  the  reactive 
control  module  are  adapted. 

We  used  six  estimators  to  evaluate  the  navigation 
performance  of  the  systems.  These  metrics  were 
computed  using  a  cumulative  average  over  the 
test  worlds  to  factor  out  the  intrinsic  differences 
in  difficulty  of  different  worlds.  Average  number 
of  worlds  solved  indicates  in  how  many  of  the 
worlds  posed  thr  system  actually  found  a  path  to 
the  goal  location.  The  optimum  value  is  100% 
since  this  would  indicate  that  every  world  pre¬ 
sented  was  successfully  solved.  Average  steps 
indicates  the  average  of  number  of  steps  that  the 
robot  takes  to  terminate  each  world;  smaller  val¬ 
ues  indicate  better  performance.  Average  dis¬ 
tance  indicates  the  total  distance  traveled  per 
world  on  average;  again,  smaller  values  indicate 

better  performance.  Average  distance 

ptr  world  indicates  the  ratio  of  the  total  distance 
traveled  and  the  Euclidean  distance  between  the 
start  and  end  points,  averaged  over  the  solved 
worlds.  The  optimal  value  is  1,  but  this  is  only 
possible  in  a  world  without  obstacles.  Average 
virtual  collisions  per  world  indicates  the  total 
number  of  times  the  robot  came  within  a  pre¬ 
defined  distance  of  an  obstacle.  Finally,  average 
time  per  world  indicates  the  total  time  the  system 
takes  to  execute  a  world  on  average. 

The  data  for  the  estimators  was  obtained  after 
the  systems  terminated  each  world.  This  was 
to  ensure  that  we  were  consistently  measuring 
the  effect  of  learning  across  experiences  rather 
than  within  a  single  experience  (which  is  less 
significant  on  worlds  of  this  size  anywavl.  The 
execution  is  terminated  when  the  navigation  sys¬ 
tem  reaches  its  destination  or  when  the  number 
of  steps  reaches  an  upper  limit  (3000  in  the  cur¬ 
rent  evaluation).  The  latter  condition  guarantees 
termination  since  some  worlds  are  unsolvable  by 
one  or  both  systems. 

In  this  paper,  we  discuss  the  results  from  the 
following  sets  of  experiments: 

Experiment  set  1:  Effect  of  the  multistrategy 
learning  method.  We  first  evaluated  the  effect 
of  our  multistrategy  case-based  and  reinforce¬ 
ment  learning  method  by  comparing  the  perfor¬ 
mance  of  the  SINS  system  against  the  static  and 
random  systems.  SINS  was  allowed  to  leant 


up  to  10  cases  (max-cases  =  10),  each  of  case- 
length  =  4.  Adaptation  occurred  every  control- 
interval  =  4  steps.  Figure  5  shows  the  results 
obtained  for  each  estimator  over  the  200  worlds. 
Each  graph  compares  the  performance  on  one 
estimator  of  each  of  the  three  systems,  static, 
random  and  adaptive,  discussed  above. 

Experiment  set  2:  Effect  of  case  paramrters. 
This  set  of  experiments  evaluated  the  effect 
of  two  parameters  of  the  case-based  reasoning 
component  of  the  multistrategy  learning  system, 
that  is,  max-cases  and  case-length,  control- 
interval  was  held  constant  at  4,  while  max-cases 
was  set  to  10, 20, 40  and  80,  and  case-length  was 
set  to  4, 6, 10  and  20.  All  these  configurations  of 
SINS,  and  the  static  and  random  systems,  were 
evaluated  using  all  six  estimators  on  200  ran¬ 
domly  generated  worlds  of  25%  and  50%  clutter. 
The  results  are  shown  in  figures  6  and  7. 

Experiment  set  3:  Effect  of  control  inter¬ 
val.  This  set  of  experiments  evaluated  the  ef¬ 
fect  of  the  control-interval  parameter,  which  de¬ 
termines  how  often  the  adaptation  and  learning 
module  modifies  the  schema  parameters  of  the 
reactive  control  module,  max-cases  and  case- 
length  were  held  constant  at  10  and  4,  respec¬ 
tively,  while  control-interval  was  set  to  4, 8, 12 
and  16.  All  systems  were  evaluated  using  all  six 
estimators  on  200  randomly  generated  worlds  of 
50%  clutter.  The  results  are  shown  in  figure  8. 

Experiment  set  4:  Effect  of  environmental 
change.  This  set  of  experiments  was  designed 
to  evaluate  the  effect  of  changing  environmen¬ 
tal  characteristics,  and  to  evaluate  the  ability  of 
the  systems  to  adapt  to  new  environments  and 
learn  new  regularities.  With  max-cases  set  to 
10, 20, 40  and  80,  case-length  set  to  4, 6  and  10, 
and  control-interval  set  to  4,  we  presented  the 
systems  with  200  randomly  generated  worlds  of 
25%  clutter  followed  by  200  randomly  generated 
worlds  of  50%  clutter.  The  results  are  shown  in 
figure  9. 

3.2  Discussion  of  Experimental  Results 

The  results  in  figures  5  through  9  show  that  SINS 
does  indeed  perform  significantly  better  than  its 
non-learning  counterpart.  To  obtain  a  more  de¬ 
tailed  insight  into  the  nature  of  the  improvement, 
let  us  discuss  the  experimental  results  in  more 
detail. 
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Figure  4:  Sample  runs  of  the  static  and  adaptive  systems  on  a  randomly  genoated  world.  The  system  starts  at  the  filled 
box  (towards  the  lower  right  side  of  the  world)  and  tries  to  navigate  to  the  unfilled  box.  The  figure  tm  the  left  shows 
the  static  system.  On  the  right,  the  ads^tive  system  has  learned  to  “balloon”  around  the  obstacles,  temporarily  moving 
away  from  the  goal,  and  then  to  “squeeze”  through  the  obstacles  (towards  the  end  of  the  path)  and  shoot  towards  the 
goal.  The  graphs  at  die  top  of  the  figures  plot  the  values  of  the  schema  parameters  over  the  duration  of  the  run. 


Experiment  set  1:  Effect  of  the  multistrategy 
learning  method.  Figure  5  shows  the  results 
obtained  for  each  estimator  over  the  200  worlds. 
As  shown  in  the  graphs,  SINS  performed  bet¬ 
ter  than  the  other  systems  with  respect  to  five 
out  of  the  six  estimators.  Figure  10  shows  the 
final  improvement  in  the  system  after  all  the 
worlds.  SINS  successfully  navigates  93%  of 
the  worlds,  a  541%  improvement  over  the  non- 
leaming  system,  with  22%  fewer  virtual  colli¬ 
sions.  Although  the  non-learning  system  was 
39%  faster,  the  paths  it  found  required  over  4 
times  as  many  steps.  On  average,  SINS’  solu¬ 
tion  paths  were  25%  shorter  and  required  76% 
fewer  steps,  an  impressive  improvement  over  a 
reactive  control  method  which  is  already  good 
at  navigation. 

The  average  time  per  world  was  the  only  esti¬ 
mator  in  which  the  self-improving  system  per¬ 
formed  worse.  The  reason  for  this  behavior  is 
that  the  case  retrieval  process  is  very  time  con¬ 
suming.  However,  since  in  the  physical  world 
the  time  required  for  physical  execution  of  a 


motor  action  outweighs  the  time  required  to  se¬ 
lect  the  action,  the  time  estimator  is  less  critical 
than  the  distance,  steps,  and  solved  worlds  esti¬ 
mators.  Furthermore,  as  discussed  below,  bet¬ 
ter  case  organization  methods  should  reduce  the 
time  overhead  significantly. 

The  experiments  also  demonstrate  an  somewhat 
unexpected  result:  the  number  of  worlds  solved 
by  the  navigation  system  is  increased  by  chang¬ 
ing  the  values  of  the  schema  parameters  even  in 
a  random  fashion,  although  the  random  changes 
lead  to  greater  distances  travelled.  This  may  be 
due  to  the  fact  that  random  changes  can  get  the 
system  out  of  “local  minima’’  situations  in  which 
the  current  settings  of  its  parameters  are  inade¬ 
quate.  However,  consistent  changes  (i.e.,  those 
that  follow  the  “regularities”  captured  by  our 
method)  lead  to  better  performance  than  random 
changes  alone. 

Experiment  set  2:  Effect  of  case  parameters. 
All  configurations  of  the  SINS  system  navigated 
successfully  in  a  larger  percentage  of  the  test 
worlds  than  the  static  system.  Regardless  of  the 


Figure  6: 


Figure  8:  Effect  of  control>intervaI 


Figure  9:  Effect  of  a  sudden  change  in  environment  (after  the  2000t  world). 
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■  235I.T 
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Figure  10:  Final  peifoimance  results. 


max'cases  and  case-length  parameters,  SINS 
could  solve  most  of  the  25%  cluttered  worlds 
(as  compared  with  55%  in  the  static  system)  and 
about  90%  of  the  50%  cluttered  worlds  (as  com¬ 
pared  with  15%  in  the  static  system).  Although  it 
could  be  argued  that  an  alternative  set  of  schema 
parameters  might  lead  to  better  performance  in 
the  static  system,  SINS  would  also  start  out  with 
those  same  settings  and  improve  even  further 
upon  its  initial  performance. 

Our  experiments  revealed  that,  in  both  25%  and 
50%  cluttered  worlds,  SINS  needed  about  40 
worlds  to  learn  enough  to  be  able  to  perform  suc¬ 
cessfully  thereafter  using  10  or  20  cases.  How¬ 
ever,  with  higher  numbers  of  cases  (40  and  80), 
it  took  more  trials  to  learn  the  regularities  in 
the  environment.  It  appears  that  larger  num¬ 
bers  of  cases  require  more  trials  to  train  through 
trial-and-error  reinforcement  learning  methods, 
and  furthermore  there  is  no  appreciable  improve¬ 
ment  in  later  performance.  The  case-len^h  pa¬ 
rameter  did  not  have  an  appreciable  effect  on 
performance  in  the  long  run,  except  on  the  aver¬ 
age  number  of  virtual  collisions  estimator  which 
showed  the  best  results  with  case  lengths  of  4  and 
10. 

As  observed  earlier  in  experiment  set  1,  SINS  re¬ 
quires  a  time  overhead  for  case-based  reasoning 
and  thus  loses  out  on  the  average  time  estimator. 
Due  to  the  nature  of  our  current  case  retrieval 
algorithm,  the  time  required  increases  linearly 
with  max-cases  and  with  case-length.  In  25% 
cluttered  worlds,  values  of  10  and  4,  respec¬ 
tively,  for  these  parameters  provide  comparable 
performance. 

Experiment  set  3:  Effect  of  control  inter¬ 
val.  Although  all  settings  resulted  in  improved 
performance  through  experience,  the  best  and 
worst  performance  in  terms  of  average  number 


of  worlds  solved  was  obtained  with  control- 
interval  set  to  12  and  4,  respectively.  For 
low  control-interval  values,  we  expect  poorer 
performance  because  environment  classification 
carmot  occur  reliably.  We  also  expect  poorer  per¬ 
formance  for  very  high  values  because  the  sys¬ 
tem  cannot  adapt  its  schema  parameters  quicUy 
enough  to  respond  to  changes  in  the  environ¬ 
ment  Other  performance  estimators  also  show 
that  control-interval  =  12  is  a  good  setting. 
Larger  control-intervals  require  less  case  re¬ 
trievals  and  thus  improve  average  time;  how¬ 
ever,  this  gets  compensated  by  poorer  perfor¬ 
mance  on  other  estimators. 

Experiment  set  4:  Effect  of  environmental 
chwge.  The  results  h^om  these  experiments 
demonstrate  the  flexibility  and  adaptiveness  of 
the  learning  methods  used  in  SINS.  Regardless 
of  parameter  settings,  SINS  continued  to  be  able 
to  navigate  successfully  despite  a  sudden  change 
in  environmental  clutter.  It  continued  to  solve 
about  95%  of  the  worlds  presented  to  it,  with 
only  modest  deterioration  in  steps,  distance,  vir¬ 
tual  collisions  and  time  in  more  cluttered  envi¬ 
ronments.  The  performance  of  the  static  system, 
in  contrast,  deteriorated  in  the  more  cluttered 
environment 

Summary:  These  and  other  experiments  show 
the  eflicacy  of  the  multistrategy  adaptation  and 
learning  methods  used  in  SINS  across  a  wide 
range  of  qualitative  metrics,  such  as  flexibility 
of  the  system,  and  quantitative  metrics  that  mea¬ 
sure  performance.  The  results  also  indicate  that 
a  go(^  configuration  for  practical  applications  is 
max-cases  =  10,  case-length  =  4,  and  control- 
interval  =  12,  although  other  settings  might  be 
chosen  to  optimize  particular  performance  es¬ 
timators  of  interest  These  values  have  been 
determined  empirically.  Although  the  empirical 
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results  can  be  explained  intuitively,  more  theo¬ 
retical  research  is  needed  to  analyze  why  these 
particular  values  worked  best. 

4  Conclusions 

We  have  presented  a  novel  method  for  augment¬ 
ing  the  performance  of  a  reactive  control  system 
that  combines  case-based  reasoning  for  on-line 
parameter  adaptation  and  reinforcement  learning 
for  on-line  case  learning  and  adaptation.  The 
method  is  fully  implemented  in  the  SINS  pro¬ 
gram,  which  has  been  evaluated  through  exten¬ 
sive  simulations. 

The  power  of  the  method  derives  from  its  abil¬ 
ity  to  capture  common  environmental  configura¬ 
tions,  and  regularities  in  the  interaction  between 
the  environment  and  the  system,  through  an  on¬ 
line,  adaptive  process.  The  method  adds  con¬ 
siderably  to  the  performance  and  flexibility  of 
the  underlying  reactive  control  system  because 
it  allows  Ae  system  to  select  and  utilize  dif¬ 
ferent  behaviors  (i.e.,  different  sets  of  schema 
parameter  values)  as  appropriate  for  the  particu¬ 
lar  situation  at  hand.  SINS  can  be  characterized 
as  performing  a  kind  of  constructive  representa¬ 
tional  change  in  which  it  constructs  higher-level 
representations  (cases)  from  low-level  sensori¬ 
motor  representations  (Ram,  1993). 

In  SINS,  the  perception-action  task  and  the 
adaptation-learning  task  are  integrated  in  a 
tightly  knit  cycle,  similar  to  the  “anytime  learn¬ 
ing”  approach  of  Grefenstette  &  Ramsey  (1992). 
Perception  and  action  are  required  so  that  the 
system  can  explore  its  environment  and  detect 
regularities;  they  also,  of  covuse,  form  the  basis 
of  the  underlying  performance  task,  that  of  nav¬ 
igation.  Adaptation  and  learning  are  required 
to  generalize  these  regularities  and  provide  pre¬ 
dictive  suggestions  based  on  prior  experience. 
Both  tasks  occur  simultaneously,  progressively 
improving  the  performance  of  the  system  while 
allowing  it  to  carry  out  its  performance  task  with¬ 
out  needing  to  “stop  and  think.” 

In  contrast  to  traditional  case-based  reasoning 
methods  which  perform  high-level  reasoning  in 
discrete,  symbolic  problem  domains,  SINS  is 
based  on  a  new  method  for  “continuous  case- 
based  reasoning”  in  problem  domains  that  in¬ 
volve  continuous  information,  such  as  sensori¬ 


motor  information  for  robot  navigation  (Ram  & 
Santamarfa,  1993).  There  are  still  several  unre¬ 
solved  issues  in  this  research.  The  case  retrieval 
process  is  very  expensive  and  limits  the  number 
of  cases  that  the  system  can  handle  without  dete¬ 
riorating  the  overall  navigational  performance, 
leading  to  a  kind  of  utility  problem  (Minton, 
1988).  Our  current  solution  to  this  problem  is 
to  place  an  upper  bound  on  the  number  of  cases 
allowed  in  the  system.  A  better  solution  would 
be  to  develop  a  method  for  organization  of  cases 
in  memory;  however,  conventional  memory  or¬ 
ganization  schemes  used  in  case-based  reason¬ 
ing  systems  (see  Kolodner,  1992)  assume  struc¬ 
tured,  nominal  information  rather  than  contin¬ 
uous,  time-varying,  analog  information  of  the 
kind  used  in  our  cases. 

Another  open  issue  is  that  of  the  nature  of  the  reg¬ 
ularities  captured  in  the  system’s  cases.  While 
SINS’  cases  do  enhance  its  performance,  they 
are  not  easy  to  interpret.  Interpretation  is  de¬ 
sirable,  not  only  for  Ae  purpose  of  obtaining  of 
a  deeper  understanding  of  the  methods,  but  also 
for  possible  integration  of  higher-level  reasoning 
and  learning  methods  into  the  system. 

Despite  these  limitations,  SINS  is  a  complete  and 
autonomous  self-improving  navigation  system, 
which  can  interact  with  its  environment  with¬ 
out  user  input  and  without  any  pre-programmed 
“domain  knowledge”  other  than  that  implicit  in 
its  reactive  control  schemas.  As  it  performs  its 
task,  it  builds  a  library  of  experiences  that  help 
it  enhance  its  performance.  Since  the  system  is 
always  learning,  it  can  cope  with  major  environ¬ 
mental  changes  as  well  as  fine  tune  its  navigation 
module  in  static  and  specific  environment  situa¬ 
tions. 
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Abstract 

Document  understanding  denotes  the  process  of 
identification  of  logical  components  of  a 
document  and  the  subsequent  extraction  of 
relationships  between  logical  components.  In 
this  paper  the  possibility  of  learning  recogniaon 
rules  for  the  identification  of  logical  components 
in  a  page  layout  is  investigated.  For  this  purpose, 
FOCL,  a  system  that  learns  function-free  Horn 
clauses,  has  been  employed  and  some  problems 
concerning  both  infinite  recursion  and  the 
convergence  of  the  learning  process  have  been 
discussed.  Finally,  a  critic  to  the  underlying 
independence  assumption  made  by  almost  all 
systems  that  learn  fiom  examples  is  presented 
and  the  problem  of  contextual  learning  is  defined. 
Some  preliminary  experimental  results  show 
that  the  definition  of  a  dependence  hierarchy 
between  concepts  can  improve  predictive 
accuracy  and  decrease  learning  time  for  labelling 
problems  like  document  understanding. 

Key  words :  document  understanding,  learning 
dependent  concepts,  contextual  learning 
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1.  Introduction 

The  automatic  classification  and  understanding 
of  multimedia  documents  are  the  fundamental 
tasks  of  an  intelligent  system  for  office 
automation,  which  aims  at  automatically  storing, 
retrieving  and  interchanging  multimedia  office 
documents.  The  system  is  currently  developed 
as  a  task  of  the  workpackage  AP  (Application 
for  automatic  classification  of  documents)  of  the 
INTREPID’  (INnovative  Techniques  for 
REcognition  and  Processing  of  Documents) 
project.  According  to  the  ODA/ODIF  standard 
(Horak,  1985),  any  document  is  characterized 
by  two  different  structures  representing  both  its 
content  and  its  internal  organization:  the  layout 
(or  geomerr/c)  structure  and  the /ogica/  structure. 
The  former  associates  the  contentof  the  document 
with  a  hierarchy  of  layout  objects  such  as  text 
lines,  vertical/horizontal  lines,  graphic  elements, 
photographic  elements,  columns,  pages  and  so 
on  (Figure  1 ).  The  latter  associates  the  content  of 
the  document  with  a  hierarchy  of  logical  objects 
such  as  title,  abstract,  paragraphs,  sections, 
chapters,  tables,  figures,  and  so  on  (Figure  2). 


'  The  woric  in  the  INTREPID  project  is  done  within  the  framework  of  the  ESPRIT  programme  and  partly  funded  by  the 
Commission  of  the  European  Communities.  The  following  companies  form  the  consortium:  AEG  Electrocom  (D),  CTA 
(E),  Nottingham  Polytechnic  (GB),  Olivetti  Systems  &  Networks  (I),  Pacer  Systems  Ltd.  (GB),  University  of  Bari  (I), 
University  of  Koblenz  (D)  and  University  of  Naples  (I). 
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Each  layout/logical  object  can  be  described 
by  a  set  of  attributes.  For  instance,  layout  objects 
can  be  characterized  by  the  type  of  content  (text, 
graphics,  etc.),  their  position  in  the  page,  their 
shape,  their  dimension  as  well  as  the  numerical 
properties  of  their  bitmaps,  while  logical 
attributes  can  be  described  by  their  type  (abstract, 
paragraph,  etc.),  some  key-words  contained  in 
the  text  (date,  figure,  etc.),  their  position. 

Relationships  among  different  objects  are 
also  possible.  Of  course,  the  hierarchy  in  the 
layout/logical  structures  defines  some 
hierarchical  relationships  among  objects  of  the 
same  structure.  However,  other,  and  perhaps 
more  interesting,  relationships  exist  among 
logical  objects  (/ogicaZ-fogica/relationships)  and 


among  layout  objects  {layout-layout 
relationships).  An  example  of  layout-layout 
relationship  is  the  mumal  position  of  two  layout 
objects  while  the  cross-reference  of  a  caption  to 
a  figure  or  the  reading  order  of  some  parts  of  a 
document  are  two  examples  of  logical-logical 
relationships.  Finally,  logical-layout 
relationships  between  one  or  more  elements  of 
"  layout  hierarchy  and  one  element  of  the 
logical  hierarchy  can  be  defined.  These  last  are 
the  most  interesting,  since  they  allow  us  to 
identify  some  logical  components  of  adocument 
without  reading  its  content  by  means  of  an 
optical  character  recognizer  (OCR)  but  using 
only  layout  (or  geometrical)  characteristics.  For 
instance,  in  a  standard  English  letter,  the  date  is 
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under  the  sender’s  address  which  is  in  turn  in  the 
top  left  hand  comer.  Thus,  this  simple  layout 
information  can  be  profitably  exploited  by  a 
document  management  system  to  identify 
specific  portion  of  content. 

By  document  analysis  it  is  generally  meant 
the  process  of  breaking  down  a  document  image 
into  several  blocks,  which  represent  layout 
components,  without  any  knowledge  regarding 
the  specific  format  (Tsujimoto  and  Asada,  1990). 

On  the  contrary,  the  term  document 
understanding  denotes  the  process  of 
identification  of  logical  components  of  a 
document  and  the  subsequent  extraction  of 
logical-logical  relationships  such  as  the  reading 
order  (Tang  et  al.,  1991).  When  there  exist 
logical-layout  relationships  due  to  a  standard 
format  of  the  document,  then  it  is  possible  to 
understand  a  document  by  using  only  layout 
information  extracted  from  the  layout  analysis 
process.  Furthermore,  the  identification  of  text 
and  picture  regions  is  also  important  in  order  to 
limit  the  application  of  the  OCR,  so  that  only 
information  useful  for  storing  and  retrieval 
purposes  is  read.  Therefore,  document  analysis 
always  precedes  the  document  understanding 
phase.  Nevertheless,  when  several  kinds  of 
documents  have  to  be  automatically  handled, 
document  understanding  becomes  a  difficult 
process  due  to  the  different  logical-layout 
relationships  met  in  each  kind  of  document.  For 
instance,  letters  from  various  companies  will 
present  different  writing  standards,  so  the 
identification  of  the  sender  or  receiver  could  be 
hard  if  only  layout  information  were  used.  Thus, 
an  intermediate  step  becomes  necessary: 
document  classification,  that  is  the  identification 
of  the  particular  class  the  document  belongs  to. 
Once  again,  layout  information  could  help  to 
recognize  the  class  of  a  document  when  there 
exists  a  definite  set  of  relevant  and  invariant 
layout  characteristics,  the  so-called  page  layout 
signature.  In  (Esposito  et  al.,  1990)  a  solution  to 
the  problem  of  document  classification  has  been 
presented.  More  precisely,  theclassificationrules 


for  each  class  of  documents  can  be  automatically 
generated  by  means  of  an  inductive  learning 
process  given  a  set  of  significant  examples  of 
documents  for  each  class  (training  set).  The 
learning  system,  named  RES,  integrates  a 
parametric  classifier,  in  particular  Fisher’s  linear 
discriminant  functions,  with  a  symbolic  learning 
method  based  on  the  STAR  methodology 
(Michalski,  1980).  The  main  advantage  of 
adopting  a  machine  learning  approach  for  the 
problem  of  document  classification  is  a  greater 
flexibility  of  the  office  documents  management 
system  since  it  can  be  customized  more  quickly 
and  easily.  The  success  of  this  approach  to 
document  classification  inducedus  to  investigate 
the  possibility  of  adopting  the  same  approach  for 
the  problem  of  document  understanding,  that  is 
recognizing  logical  components  of  a  document. 

Given  a  set  of  documents  whose  page  layouts 
have  already  been  analyzed  and  assumed  that 
the  user- trainer  has  already  labelled  some  layout 
components  according  to  their  meaning  (e.g., 
sender  or  receiver  of  a  letter),  the  problem  is  that 
of  learning  some  rules  that  allow  the  correct 
labelling  of  layout  components  to  be  performed. 
As  said  above,  the  problem  of  document 
understanding  can  be  strongly  simplified  when 
the  class  of  documents  has  already  been 
identified.  Indeed,  in  this  case  we  can  more 
easily  define  logical  components  for  each  class 
of  documents  and  we  significantly  reduce  the 
variability  of  training  instances  for  this  new 
learning  problem.  In  spite  of  such  a  shrewdness, 
the  problem  of  learning  rules  for  document 
understanding  is  still  more  complex  than  the 
problem  of  learning  recognition  rules  for 
classifying  documents.  In  fact,  in  document 
understanding,  concepts  to  learn  refer  to  a  part  of 
a  document  rather  than  to  the  whole  document, 
and  since  parts  of  documents  may  be  related  to 
each  other  according  to  logical-logical 
relationships,  this  leads  to  the  problem  of  learning 
mutually  dependent  concepts  (or  contextual 
rules).  Mast  of  the  studies  on  supervised  inductive 
learning  presented  in  the  machine  learning 
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literature  make  the  implicit  assumption  that 
concepts  are  independent  {independence 
assumption),  and  consequently,  that  training 
instances  are  independent.  Of  course,  traditional 
learning  algorithms  making  the  independence 
assumption  can  still  be  exploitedfor  the  problem 
of  document  understanding  by  simply  neglecting 
logical-logical  relationships,  but  it  is  our  opinion 
this  is  not  the  correct  way  to  solve  the  problem. 
Indeed,  document  understanding  is  a  particular 
case  of  labelling  problems,  in  which  the  correct 
label  can  often  be  assigned  to  a  pan  of  a  complex 
object  only  by  taking  into  account  spatial 
relationships  with  other  parts  whose  labels  are 
already  known.  Thus,  we  believe  that  by  taking 
into  account  concept  dependencies  it  is  possible 
to  generate  more  accurate  and  simpler  rules, 
since  the  learning  paradigm  is  a  better 
approximation  of  reality.  When  concept 
dependencies  are  intrinsically  acyclic,  the 
structure  of  concept  dependencies  can  be 
represented  by  means  of  a  directed  acyclic  graph. 
Such  a  graph  can  be  either  provided  by  the 
teacher  oritcan  be  inferredby  means  of  statistical 
techniques.  The  former  solution  realizes  a 
combination  of  interactive  concept  learning  and 
supervised  inductive  learning,  while  the  latter 
provides  a  multistrategy  learning  methodology 
that  integrates  numerical  and  symbolic  learning. 

In  the  next  section,  a  representation  language 
used  to  describe  a  page  layout  will  be  presented 
and  the  opportunity  of  introducing  some 
intensionally  defined  predicates  in  the 
backgroundknowledge  will  be  discussed.  Section 
3  is  devoted  to  the  problem  of  learning  Horn 
clauses  by  means  of  a  well-known  learning 
system:  FOOL  (Pazzani  and  Kibler,  1992).  Some 
experimental  results  on  the  problem  of  learning 
rules  for  document  understanding  by  means  of 
the  traditional  strategy  are  presented  in  section 
4,  while  the  problem  of  learning  dependencies 
between  concepts  together  with  results  of  the 
contextual  learning  strategy  for  document 
understanding  are  shown  in  section  5. 


2.  ARepresentation  Language  For  Page 
Layout  Description 

In  the  general  inductive  problem  (Muggleton, 
1992),  we  are  provided  with: 

-  Lq:  the  language  of  observations 

-  Lg:  the  language  of  background  knowledge 

-  Ljj:  the  language  of  hypotheses 

-  a  set  of  examples  or  observations,  O,  described 
by  using  Lq 

-  some  background  knowledge,  B,  described 
by  means  of  Lg 

and  we  want  to  find  a  hypothesis  H,  described  in 
the  language  such  that: 

BaH|-0 

Therefore,  before  describing  how  hypotheses, 
i.e.  rules  for  document  understanding,  are 
generated,  it  is  necessary  to  introduce  the  three 
languages,  L^,  Lg,  and  L„. 

In  the  application  of  document  understanding, 
Lq  is  the  language  used  to  describe  instances  of 
different  logical  objects.  Each  instance  is 
represented  as  a  ground  Horn  clause  in  which 
different  constants  represent  different  layout 
objects  of  one  or  more  documents  as  well  as  the 
documents  themselves.  In  particular,  a  subset  of 
the  Horn  clauses  is  used,  namely  linked  Horn 
clauses  (Helft,  1987),  since  it  allows  only 
meaningful  hypotheses  to  be  represented.  The 
set  of  extensionally  definedpredicates  is  reported 
in  Table  I.  A  predicate  is  extensionally  defined 
when  a  list  of  mples  for  which  the  predicate  is 
true  is  provided.  Figure  3  shows  an  example  of 
page  layout  in  which  some  blocks  have  been 
labelled.  There  are  five  different  logical  objects, 
namely  sender,  receiver,  logotype,  reference 
number  and  date.  Other  blocks  are  purposely 
unlabelled  since  we  are  notinterested  to  recognize 
each  part  of  the  document  Obviously,  each 
document  is  a  source  of  more  than  one  instance 
of  logical  components.  Therefore,  we  can  write 
down  as  many  groundHom  clauses  as  the  number 
of  layout  objects  in  a  page  layout  In  Figure  4  the 
description  of  the  block  sender  of  the  document 
is  provided. 


Table  I 

Predicates  for  the  Page  Layout  Description 


Predicate 

Meaning 

logic_type-sender(X} 

logtc_type-Rceivet(X) 

logic_iype-logo(X) 

logic_iype-ief(X) 

logic_type-dMe(X) 

logic_type-un»gned(X) 

logical  label  of  the 
layout  object  X 

width-veiy-veiy-snull(X) 

width-vety-$inall(X) 

width-aiull(X) 

width-nie<iium-smail(X) 

width-mediuin(X) 

widih-in«liuin-large(X) 

width-iarge(X) 

widih-veiy-large(X) 

wi<hh-veiy-veiy-Uige(X) 

width  of  the  layout 
object  X 

beight-snullest(X) 

heighi>veiy-veiy-siiiall(X) 

heighi-voy-snullCX) 

height- snulIQO 

heighi-niediuin-sTnall(X) 

height-medium  (X) 

height-medium-large(X) 

height-Urge(X) 

heighi-very-large(X) 

height-veiy-vety-large(X) 

height-largett(X) 

height  of  the  layout 
object  X 

type-iexi(X) 

type-hor-line(X) 

type-piemre(X) 

iype-ver-line(X) 

type-gnphicOC) 

iype-mixiure(X) 

type  of  the  layout 
object  X 

pan_of(X,  Y) 

layout  object  Y  belongs 
to  document  X 

position-U>p-ieft(X) 

position-u>p(X) 

posiuon-top-righi(X) 

posiuon-left(X) 

position-cenier(X) 

position-right(X) 

position-bottom-lefi(X) 

posiuon-bottom(X) 

position-boaom-right(X) 

position  of  the  layout 
object  X 

on_top(X,  Y) 

layout  object  X  is  on  top 
of  layout  object  Y 

to_iightOC,  Y) 

layout  object  X  is  to  the 
right  of  layout  object  Y 

aligned-only-left-coICX.  Y) 
aligned-only-right-col(X,  Y) 
aligned-only-middle-coIQC,  Y) 
aligiied-boih-co]umns(X.  Y) 
aligned-on]y-upper-row(X,  Y) 
aligned-only-lower-rowtX,  Y) 
aligned-onty-middle-row(X.Y) 
aligiied-botb-rows(X,  Y) 

layout  objects  X  and  Y 
are  aligned 

The  language  of  hypotheses,  Lj,,  generated 
by  FOCL,  is  a  subset  of  pure  Prolog,  in  which 
neither  functions  nor  constants  occur:  Horn 
clauses  of  are  called  function-free.  As  FOIL 
(Quinlan,  1990),  FCXX  adopts  Prologfs  negation- 
as-failureruk  (Qark,  1978)  todefinethe  meaning 
of  anegated  predicate.  Ahypolhesisis  expressed 
as  a  collection  of  function-free  Horn  clauses 
having  the  same  head.  Such  a  collection  is  called 
rule  or  predicate  definition.  FOCL  allows 
predicates  to  be  defined  intensionally  as  well, 
that  is  it  provides  a  way  to  introduce  some 
inference  rules  as  background  knowledge  to  use 
during  the  induction  process. 

For  the  application  of  document 
understanding,  we  defined  several  inference  rules 
concerning  the  position  of  a  block,  the  type  of 
alignment  between  blocks  and  the  mutual 
position  of  blocks.  As  to  the  position  of  blocks, 
a  page  was  originally  split  in  nine  areas  by 
discretizing  the  numerical  coordinates  of  the 
centre  of  each  block.  However,  some  logical 
components  may  be  in  different  positions  but  in 
the  same  band  (a  band  is  a  set  of  three  contiguous 
positions  in  the  page).  In  this  case,  information 
on  the  band  may  be  more  useful  than  the  detailed 
information  on  the  area,  since  it  makes  the 
generation  of  a  rule  easier.  Therefore,  we 
introduced  the  following  inference  rules  as 
background  knowledge: 
top-horiz-band(X)  <-  position-iop-left(X) 
top-horiz-band(X)  <-  position-lop(X) 
toi>-horiz-band(X)  <-  position-top-right(X) 
central-hoTiz-band(X)  <-  position-left(X) 
central-horiz-band(X)  position-center(X) 
central-hoiiz-band(X)  position-right(X) 
bottom-horiz-band(X)  «-  position-bottom-left(X) 
bottom-horiz-band(X)  position-bouoin(X) 
bottom-horiz-band(X) position-bottotn-righi(X) 
left-vert-band(X)  <-  position-top-lefi(X) 
left-vert-band(X)  <-  position-left(X) 
left-vcrt-band(X)  <-  position-bouoin-left(X) 
central-vert-band(X)  <-  position-top(X) 
central-vert-band(X) «-  position-center(X) 
central-vert-band(X)  «-  position-bottoni(X) 
right-ven-band(X)  <-  position-top-right(X) 
right-vert-band(X)  <-  position-right(X) 
right-vert-band(X)  <-  posidon-bottoin-right(X) 
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Figure  3.  Page  layout  of  a  document  with 
labelled  blocks. 

We  also  defined  the  predicates  aligned-by- 
column  and  aligned-by-row  as  follows: 
aligned-by-column(X,Y)  «-  aligned-only-left-col(X,Y) 
aligned-by-column(X,  Y)  <- 

aligned-only-tniddle-col(X,Y) 
aligned-by-column(X,Y)  aligned-only-right-col(X,  Y) 
aligned-by-column(X,Y)  <-  aligned-both-columns(X,Y) 
aligned-by-row(X,Y)  «-  aligned-only-upper-row(X,Y) 
aligned-by-row(X,Y) «-  aligned-only-iniddle-row(X,Y) 
aligned-by-row(X,  Y)  <-  aligned-onIy-lower-row(X,  Y) 
aligned-by-row(X,Y)  *-  aligned-both-rows(X,Y) 
Moreover  it  is  possible  to  define  the  following 
predicates: 

aligned-left-col(X,Y)  <-  aligned-only-left-col(X,Y) 
aligned-left-col(X,Y)  <-  aligned-both-coluinns(X,Y) 
aIigned-middle-col(X,Y)  <- 

aligned-only-middle-col(X,Y) 
aligned-iniddle-col(X,  Y)  «- aligned-both<olurans(X,Y) 
aligiied-right-col(X,Y)  «-  aligned-only'right-col(X,Y) 
aligned-right-col(X,Y)  <-  aligned-both-columnspC.Y) 
aligned-middle-row(X,Y)  <- 


logic_iypc-scndeT(x2)  *- 
logic_typc-receiver(x3),  lpgic_type-unsigned(x4), 
logic_typc-logo(x5),  logic_typc-<late(x6),  logic_type-ref(x7), 
logic_typc-tmsigne<l(x8).  logic_typc-uns>gncd(x9), 
logic_type-unsigned(xlO),  logic_type-unsigned(xl  1), 
pan_of(xlji2),part_of(xl,x3),pan_of(xlji4),part_of(xljt5), 
part_of(x  1  ji6),  part_of(x  1  .x7),  pan_of(x  1  .x8),  pait_of(x  1  jt9). 
pan_of(xl,xlO),part_of(xl,xl  1), 
width-mediu]n(x2),  width-mediuin-large(x3), 
width-sma]lest(x4),  wid(h-mediuin(xS), 
width-mediuin-smaU(x6).width-inediun)-laTge(x7), 
width-vcry-veTy-large(x8),widih-veTy-vety-largc(x9), 
width-niedium-large(xlO),  width-smallest(x  11), 
height-mediuin-largc(x2),height-small(x3), 
height-smallest(x4).hcighi-vcTy-smaU(x5), 
height-veiy-vciy-small(x6),hcighl-v«y-vciy-small(x7), 
height-smallest(x8),height-large(x9), 
height-medium-smaIl(xlO).hcight-smallest(xll), 
type-tcxt(x2).  typc-text(x3),  type-lcxt(x4), 
type-picturc(x5),type-ttxt(x6),type-iexi(x7). 
type-text(x8),type-tcxt(x9).typc-inixniie(xl0), 
type*text(xll),posidon-top-left(x2),posiuon-top(x3). 
position-top-lefl(x4),  posmon-tcp-left(xS). 
position-top-iight(x6),position-top(x7), 
position-center(x8),  position -center(x9), 
po5ition-bonom-right(x  1 0),  position-bottom-lcft(x  11), 
on_iop(x5jt8),  on_tpp(x6,x8),  on_iop(x7,x8),  on_top(x9.xl0), 
io_right(x2,x4),to_Tight(x5.x7), 

aligncd-both-coluinns(x2jt5),aligncd-only-lowcr-row(x5jt7), 
aligncd-only-left-col(x4jt7),aligpcd-both-tows(x7,x6), 
aligncd-only-right-col(x8jt9),aligncd-on]y-uppcr-row(x4jt3), 
aIigned-only-lcft-col(x8,xl  1) 

Figure  4.  Ground  Horn  clauses  for  the  sender 
of  the  layout  in  Figure  3. 

aligned-only-niiddle-rowCiC,Y) 
aligned-middle-rowpc.Y)  <-  aligned-both-rows(X,Y) 
aligiied-upper-row(X,Y)<-  aligned-only-upper-row(X,Y) 
aligned-upper-row(X,Y)  <-  aligned-both-rows(X,Y) 
aligned-lower-row(X,Y)<-aligned-only-lower-row(X,Y) 
aligned-lower-rowpc.Y)  <-  aligned-both-rows(X,y) 

It  is  worthwhile  to  notice  that  aligned-by- 
column(X,  Y)  means  that  X  and  Y  are  aligned  by 
column  and  X  is  above  Y.  However,  this  does 
not  imply  that  X  is  onjop  Y,  since  the  literal 
on_top(X,Y)  states  that  X  is  above  Y  and  their 
distance  is  less  than  50  points  on  the  vertical 
axis.  Thus,  it  makes  sense  to  define  the  predicate 
above  as  follows: 
above(X,Y)  <-  on_top(X,Y) 
above{X,Y)  <-  aligned-by-column(X,Y) 

Analogously,  we  defined  the  predicate 
to_the_right_side  as  follows: 
to_the_right_side(X,Y)  to-right(X,Y) 


282 


«0-the_righi_sidc(X.Y)  <-  aligned-by-row(X.Y) 

Horn  clauses  allow  recursion  to  be  represented 
as  well.  Recursion  can  be  quite  useful  for  the 
application  of  document  understanding.  For 
instance,  let  us  consider  the  portion  of  layout 
show  in  Figure  5.  For  some  reason,  it  happened 
that  the  logical  component  sender  has  been 
fragmented  into  several  layout  blocks,  but  the 
fragment  4  can  be  easily  recognized  since  it  is 
above  a  block  of  type  picture.  If  recursive 
defrnitions  are  used,  then  we  can  easily  label 
blocks  1 , 2  and  4  by  means  of  the  following  rule: 
scnder(X)  <-  above(X,Y),  type-piciure(Y) 
senderCX)  «-  on_top(X.Y),  sender(Y). 

Any  other  rule  for  the  recognition  of  blocks 
sender  would  be  more  complex,  and,  in  out 
opinion,  sometimes  less  accurate  than  that  given 
above.  However,  recursion  should  be  used  with 
caution:  it  is  necessary  to  check  the  existence  of 
a  termination  condition  in  all  cases  in  order  m 
avoid  infinite  recursicn.  For  instance,  the 
recursive  rule: 

Iogic_type-ref(X)  io_the_iight_side(X,Y), 

to_the_right_sideCZ,Y), 
logic_type-ref(Z), 
aligned-upper-row(X,W) 
logic_type-ref(X)  width-smalI(X), 

central-vert-band(X) 

can  cause  infrnite  recursion  since  X  and  Z  can  be 
bound  to  the  same  layout  block. 

There  are  several  types  of  recursion.  In  direct 
recursion,  like  the  previous  example,  the  same 
literal  appears  both  in  the  head  and  in  the  body  of 
a  clause.  In  indirect  (or  mutual)  recursion,  a 
predicate  p(x)  in  the  bocfy  of  a  clause  whose  head 


Figure  5.  The  problem  of  fragmentation  in  page 
layout 


is  q(x)  appears  in  the  head  of  another  clause 

whose  body  contains  q(x): 

q(x)  <-  p(x),  r(x)  p(x)  <-  q(x).  s(x). 

In  roi/recursitni,  the  computation  of  afunction 
has  already  been  completed  when  the  axiomatic 
level  is  reached.  An  example  of  tail  recursion  is 
given  by  the  following  definition  of  the  Prolog 
predicate  reverse: 
reversc(X,Y)  «-  reverse  1(X,Q,Y) 

reverseUnXX) 

reverseiaAIX].Y,Z)  f-  reverse  1(X.[AIY]2) 
which  is  more  efficient  than  the  classical  direct- 
recursive  definition: 
reverse(D,D) 

reverse([AIX],Y)  <-  reverse(X,Z),  append(Z,[A],Y). 

In  fact  tail-recursion  generally  uses  memory 
more  efficiently  than  direct  recursion  and 
therefore  it  would  be  better  to  learn,  when 
possible,  tail-recursive  definitions.  However, 
this  means  that  new  predicates,  which  were  not 
present  in  the  original  definition,  have  to  be 
generated  (e.g.,  reverse!),  thus  the  problem  is 
further  complicated  but  the  solution  lies  in 
constructive  induction. 

For  the  application  of  document 
understanding,  directrecursion  is  the  most  useful 
form  of  recursion  we  need,  even  if  in  learning 
dependent  concepts  indirect  recursion  is  also 
desirable.  We  use  an  option  of  FOCL  that 
implements  a  very  simple  technique  to  prevent 
infinitely  recursive  clauses.  In  particular,  when 
a  recursive  litmnl  p(Y)  is  added  to  the  body  of  a 
clause  whose  head  is  p(X),  the  literal  not(X=Y) 
is  added  as  well.  This  technique  is  better  than 
that  implemented  in  mFOIL  (Dzeroski  and 
Bratko,  1992),  but  less  sophisticated  than  that 
reported  in  (Quinlan,  1990)  for  FOIL. 

3.  An  Algorithm  That  Learns  Horn 
Clauses:  FOCL 

FOCL  (Pazzani  and  Kibler,  1992)  is  an  extension 
of  FOIL  ((^nlan,  1990)  in  several  aspects.  As 
FOIL,  it  implements  a  separate-and-conquer 
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strategy  to  learn  a  rule.  Given  a  set  of  positive 
and  negative  instances  of  a  concept  p(Xj,  Xj, . . 
X,)  to  learn,  FOCL  starts  with  the  initial 
hypotheses: 

P(X,,x, . xj<- 

whose  body  is  empty  and  repeatedly  adds  a  new 
literal  q(Y^,  Y^, Y^)  until  the  clause  covers 
only  positive  instances.  Given  a  clause: 

P(X,.X, . X^*-«iKZ,.Z, . ZJ 

where  <p(Zj,  Zj, . . .,  Z^)  denotes  a  conjunction  of 
literals,  a  tuple  is  a  value  assignment  for  the 
variables  Xj,  Xj, ...,  X^,  Zj,  Z^, ...,  Z^  such  that 
the  clause  is  satisfied.  In  particular  the  tuple  is 
posidve/negative  if  the  value  assignment  for  the 
variables  in  the  head  coincides  with  a  posidve/ 
negadve  instance  of  the  predicate  p(Xj,  Xj, ..., 
X^).  Tne  addidon  of  a  literal  q(Y,,  Yj, . . .,  Y^)  to 
the  body  of  an  inconsistent  clause  may  change 
the  set  of  tuples  covered  by  the  clause  and, 
consequendy,  the  propordon  of  posidve  and 
negative  tuples  in  such  a  set.  Among  all  possible 
literals  that  can  be  added  to  a  clause,  that  one 
maximizing  an  informadon  theoretic  heurisdc, 
called  information  gain,  is  selected.  In  fact,  a 
positive  information  gain  means  that  the 
proportion  of  positive  tuples  with  respect  to  the 
set  of  covered  tuples  is  increased  by  adding  a 
given  literal  to  the  clause.  The  information  gain 
can  be  computed  for  predicates  defined  both 
extensionally  and  intensionally.  When  a  partial, 
possibly  incorrect,  intensional  definition  of  the 
concept  to  learn  is  provided,  FOCL  uses  the 
information  gain  metric  in  otxler  to  operationalize 
the  concept  description  as  in  explanation-based 
learning.  However,  in  our  application  to 
document  understanding  such  a  potential  is  not 
exploited. 

Another  characteristic  that  distinguishes 
FOCL  from  FOIL  is  the  availability  of  relational 
cliches  that  suggests  potentially  useful 
combinations  of  predicates  to  test  while 
generating  a  clause  of  a  predicate  definition 
(Sdverstein  and  Pazzani,  19^1).  In  this  way, 
cliches  provide  a  form  of  look-ahead  that  tries  to 
overcome  die  problem  of  horizon  effect  leading 


a  hill-climbing  search  strategy  to  find  local  rather 
than  global  maxima.  For  instance,  the  following 
clause: 

logic_type-sender(X)  «-  above(X,Y),  type-picuire(Y) 
that  allows  several  logical  objects  of  type  sender 
to  be  recognized,  cannot  be  generated  without 
the  introduction  of  cliches,  since  the  predicate 
abovc(X,  Y)  has  a  very  small  positive  information 
gain  and  FOCL  prefers  other  literals  to  that.  In 
this  case,  it  is  the  type  of  the  block  below  a  sender 
to  discriminate  a  sender  from  any  other  kind  of 
logical  object,  but  unfortunately,  throwing  out 
the  literal  above(X,Y),  prevents  the  learning 
system  from  discovering  such  a  discriminant 
information.  By  introducing  a  relational  cliche 
in  which  FOCL  is  said  to  test  the  couples  of 
literals: 

above(X.Y).q(Y,.Y, . Y,) 

the  above  inconvenience  is  solved,  but  the  search 
space  to  explore  becomes  wider. 

4.  Experimental  Results 

In  section  2,  a  set  of  inference  rules  that  form  the 
background  knowledge  has  been  defined. 
However,  the  utility  of  these  rules  has  to  be 
evaluated  empirically.  For  this  reason,  we 
considered  a  set  of  30  single  page  documents, 
namely  copies  of  letters  sent  by  Olivetti.  For 
each  experiment,  this  set  was  randomly  split  into 
two  subsets  according  to  the  following  criterion: 
-  20  documents  for  the  training  set 
- 10  documents  for  the  test  set 

There  are  five  concepts  to  learn,  namely 
sender  of  the  letter,  receiver,  logotype,  reference 
number  and  date  (other  concepts,  such  as  body 
of  the  letter  or  signature  will  be  considered  in 
future  experiments).  Obviously,  not  all  blocks 
are  instances  of  one  of  these  concepts,  that  is 
there  are  some  unlabelled  blocks  that  we  are  not 
interested  to  classify.  Moreover,  there  might  be 
more  than  one  block  with  the  same  label  in  a 
document,  since  some  logical  components  might 
have  been  fragmented  into  several  layout  blocks 
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that  the  layout  analysis  was  not  able  to  group 
together. 

In  the  first  experimentation  no  background 
knowledge  was  used  during  the  generalization 
process.  Six  different  experiments  were 
organized  by  randomly  selecting  the  documents 
for  the  training  and  test  sets.  Results  for  each 
experiment  are  reponed  in  Table  II,  where  the 
entries  n/m  of  each  experiment  report  the  num  ber 
of  commission  (n)  and  omission  (m)  errors  for 
each  rule.  The  last  column  of  the  table  reports  the 
average  of  the  error  rates  for  each  experiment 
computedas  the  sum  of  commission  and  omission 
errors  divided  by  the  number  of  logical 
components  in  the  test  set  for  a  given  class.  The 
TOTAL  average  error  is  calculated  as  the  average 
of  the  TOTAL  errors  for  each  experiment,  divided 
by  the  number  of  logical  canponents  (including 
unlabelled  blocks).  The  entry  "tot.  cost"  for  each 
experiment  concerns  the  number  of  tested  literals, 
namely  those  involving  intensionally  and 
extensionally  defined  predicates,  as  well  as  the 
pairs  of  literals  tested  with  clichds.  Since  in  this 
first  experimentation  neither  background 
knowledge  nor  relational  cliches  were  used,  the 
reported  numbers  refer  to  the  number  of  literals 
involving  only  extensionally  defined  predicates. 

Table  III  summarizes  results  concerning  the 
second  experimentation  in  which  the  background 
knowledge  is  used.  The  structure  of  the  table  is 
the  same  as  that  of  Table  n,  but  in  this  case  the 
total  cost  includes  the  number  of  literals  involving 
intensionally  defined  predicates  as  well.  It  is 
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Results  of  the  first  experimentation: 
_ basic  case 
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possible  tt>  note  that  the  introduction  of  die 
background  knowledge  did  not  significantly 
improve  the  predictive  accuracy  of  the  final 
rules  since  the  difference  is  only  .2%  better  than 
the  basic  case,  while  the  number  of  the  tested 
literals  varies  from  36%  to  300%  more  than  that 
for  the  basic  case. 

in  the  third  experimentation  two  cliches  were 
introduced,  namely  the  ON  TOP  cliche  for  the 
pairs  of  literals 

on_top(X.Y).  q(Y,.Yj . 

in  which  ^  is  a  predicate  defined  either 
extensionally  or  intensionally  and  at  least  one  of 
the  variables  Y.  is  in  the  set  {X,Y},  and  the 
TOJUGHT  clich6  that  tested  all  pairs  of  literals 

to_right(X,Y),  q(Y,.Yj . Y^). 

By  looking  at  the  results  reported  in  Table  IV,  it 
is  possible  to  observe  that  the  introduction  of 
cliches  did  not  improve  significantly  the  accuracy 
with  respect  to  the  basic  case  (no  background 
knowledge  and  no  cliches).  Strangely,  it  seems 
that  by  enlarging  the  search  space  helps  to  learn 
betterrules  for  some  class,  such  as  r^,  but  it  leads 
the  system  to  consider  wrong  hypotheses  for 
other  concepts,  such  as  date. 

In  another  experimentation  (see  Table  V) 
both  the  backgroundknowledge  and  four  clich6s 
were  introduced.  In  particular,  in  addition  to 
ON_TOP  tmdTOJUGHT,  other  two  cliches  were 
introduced,  namely  ABOVE  and 
TOJTHE_RIGHT_SIDE  that  allow  for  testing 
pairs  of  literals  in  which  the  first  literal  contains 
the  intensionally  defined  predicates  above  and 
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Results  of  the  second  experimentation: 
basic  case  +  BK 
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to_thejright_side  respectively.  In  this 
experimentation  the  global  results  worsen 
(average  error  rate  =  8.5)  and  for  one  training  set 
F(X!L  was  not  able  to  generate  a  complete  and 
consistent  rule  for  the  concept  date  after  almost 
4  hours  of  CPU  time  on  a  SUN  station  4/25. 
Indeed,  FOCL  preferred  to  introduce  predicates 
with  new  variables  (more  than  10),  so  enormously 
widening  the  search  space  at  each  step  without 
really  improving  consistency.  The  existence  of  a 
simpler  and  consistent  rule  is  guaranteed  by  the 
fact  that  in  the  previous  experiments  with  only  2 
cliches  FO(X  always  converged  towards  a 
solution.  A  way  to  force  FOGL  to  converge 
towards  a  simpler  solution  than  that  it  is  looking 
for  is  to  define  a  limit  on  the  maximum  number 
of  new  variables  that  can  be  introduced  in  a  rule. 
However,  this  is  simply  a  trick  to  bypass  the 
problem  of  divergence,  but  the  true  problem  is  in 
the  information  gain  function  that  guides  the 
hill-climbing  search.  Indeed,  this  heuristic  seems 


Table  IV 

Results  of  the  third  experimentation: 
basic  case  +  two  cliches 
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Table  V 

Results  of  the  fourth  experiment: 
basic  case  +  4  cliches  +  BK 
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biased  towards  the  introduction  of  new  variables 
rather  than  towards  the  discriminatory  power  of 
a  literal.  For  instance,  in  a  run  we  observed  that 
the  system  preferred  the  pair  of  literals: 

to-the-right-sidc(X.  Y).  aIigned-uR)cr-iow(Z,  Y) 
covering 34/36 positive  tuples  and 38/79 negative 
tuples  (gain  67.0),  rather  than  the  literals: 

io-right(X,  Y),  logic_type-logo(Y) 
that  covered  21/25  positive  tuples  and  1/42 
negative  tuples  (gain  62.7).  A  similar  problem 
on  the  gain  function  used  in  ID3  for  the  selection 
of  the  next  attribute  to  test  has  also  been  noticed 
by  Fayyad  in  the  induction  of  decision  trees  for 
multiple  concept  learning  (Fayyad,  1991). 

Since  in  section  2  the  generation  of  recursive 
rules  has  been  claimed  useful  for  the  problem  of 
document  understanding,  we  also  tried  to 
introduce  recursion  in  the  learning  process.  Table 
VI  summarizes  the  experimental  results,  which 
would  be  probably  better  than  those  shown  in 
Table  V  had  not  the  concept  date  led  to  problems 
of  non  convergence  of  the  learning  process  and 
infinite  recursion.  In  fact,  in  one  case  F<X^L  did 
not  generate  any  rule  after  4  hours  of  CPU  time, 
while  in  another  it  generated  the  following  rule: 
logic_type-daie(X)  <-  to_lhe_righl_side(Y,  X), 
to_Uie_right_side(Z,Y), 
above(X,  W),  heigta-sinallesi(W), 
-.to_right(Y,  X). 

lc^ic_type-date(X)  <-  io_tlie_right_side(Y,  X), 

->(X  =  Y),  logic_typc-date(Y). 
logic_type-date(X)  <-  io_ihe_right_sideOf^,  X), 
to_the_right_side^,Y), 


Table  VI 

Results  of  the  fifth  experimentation: 
basic  case  +  4  cliche  +  BK  +  recursion 
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to_thc_right_side(Z.  W). 
cemral-ven-band(W), 
above(X,  V). 
aIigned-by-coluinn(V.  U). 
logic_typc-datc(X)<-  to_the_right_side(Y.  X). 

aligned-iower-row(Z,  Y), 
-nio-righl(Y.X), 
width-mediuin-sinall(X). 
logic_typc-date(X)  *-  to_righi(X,  Y).  -<X  =  Y), 
logic_typc-date(Y), 
width-sinall(X). 

which  causes  problems  of  infinite  recursion  for 
some  block,  say  Bl,  which  is  not 
to  jhejright  side  of  another  block  but  it  is  itself 
tojright  another  block,  say  B2.  Indeed,  in  this 
case  the  first  four  clauses  fail  to  prove  the  goal 
logic jype-date(Bl),  while  the  last  clause  is 
satisfied  if  it  is  proven  the  goal  logicjype- 
date(B2).  However,  if  the  first  clause  cannot 
explain  this  subgoal,  the  classifier  finds  the 
second  clause  in  which  the  first  literal  is  certainly 
true,  and  then  tries  to  prove  again  the  goal 
logicjtype-date(BI ),  thus  falling  in  infinite 
recursion. 

5.  Contextual  Learning 

5.1  The  problem 

Inductive  learning  is  undoubtedly  the  paradigm 
that  has  been  most  widely  investigated  in  machine 
learning.  In  particular,  several  studies  have  been 
made  on  learning  from  examples  and  many 
learning  systems  have  been  developed.  They 
differ  in  several  aspects,  such  as  the 
representation  language  used  to  represent 
examples  or  observations  as  well  as  the 
background  knowledge  and  the  hypotheses,  the 
search  strategy  adopted  to  search  in  the  space  of 
hypotheses  defined  by  the  representation 
language,  the  amountandthe  type  of  background 
knowledge  exploited  during  the  learning  process. 

However,  there  is  an  aspect  that  joins  most  of 
the  studies  on  learning  fi’om  examples:  the  basic 
assumption  that  concepts  are  mutually 
independent.  Even  though  in  many  applications 


such  an  assumption  is  reasonable,  this  is  not 
generally  true.  For  instance,  if  our  aim  is  that  of 
recognizing  flowers  and  trees  in  a  picture,  we 
can  try  to  learn  the  concepts  independently.  This 
means  that  we  provide  a  learning  system  with 
instances  of  trees  and  flowers  and  we  try  to  find 
those  properties  that  characterize  them.  However, 
in  this  case  we  are  deliberately  neglecting  all 
other  properties  that  relate  the  two  concepts, 
such  as  their  relative  hei^t  (trees  are  taller  than 
flowers),  that  can  make  easier  the  recognition 
task  in  most  of  the  cases.  Astute  readers  will 
surely  note  that  relationships  between  the 
concepts  of  trees  and  flowers  do  not  characterize 
the  two  concepts,  but  express  a  constraint 
between  them.  In  fact,  when  the  independence 
assumption  is  not  made,  the  learning  problem 
becomes  that  of  learning  propenies  that 
characterize  each  concept  (or  discriminate  it 
from  other  concepts)  as  well  as  dependencies  (or 
constraints)  between  concepts. 

A  natural  consequence  of  conceptdependency 
is  that  instances  of  dependent  concepts  are 
dependent  themselves.  This  gives  us  an 
immediate  way  to  recognize  those  learning 
systems  in  which  the  independence  assumption 
is  made:  they  allow  for  representing  only 
independent  instances.  For  example,  all  relational 
databases  used  by  the  machine  learning 
community  to  test  different  learning  systems 
represent  instances  of  concepts  as  (n+l)-mples: 
<aj,a2,  ...,a^,  c> 

where  a.  are  attributes  of  the  concept  and  c  is  the 
membership  class  (i.e.  the  name  of  the  concept). 
In  this  case  concept  dependencies  can  only  be 
expressed  by  allowing  the  domain  of  some 

attribute  a.  to  contain  values  of  the  domain  of  the 

1 

class  attribute  c,  but  this  is  never  done. 
Sometimes,  the  assumption  that  classes  are 
mutually  exclusive  is  made  explicit  (Quinlan, 
1986),  while  rarely  the  problem  of  instances  that 
belong  to  different  classes  (e.g.  patients  that 
show  symptoms  common  to  two  different 
diseases)  is  at  least  considered  (Michalski,  1983). 

Another  clear  example  of  the  independence 
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assumption  can  be  found  in  the  parametric 
methods  studied  in  statistical  pattern  recognition. 
Indeed,  the  formulation  of  these  methods  begins 
with  the  statement  that  each  class  is  described  by 
a  class  distribution  function,  p(x.  I  x.€  C,  0p, 
which  gives  the  probability  of  adatum  x.  if  it  were 
known  to  belong  to  class  C  (0.  is  simply  a  class 
parameter  vector  that  characterizes  the  class 
distribution  in  a  parametric  family  of 
distributions)  (Hand,  1981).  Trivially,  in  the 
case  of  diseases  that  show  common  symptoms, 
class  distribution  functions  are  no  longer  adequate 
and  we  need  to  consider  the  joint  probability 
function  p(Xj  I  x.e  Cj,x.€  Cj,. .  .,x.e  C^,0). 

It  should  be  observed  that  the  fact  that 
instances  are  considered  independent  by  all  the 
methods  of  learning  from  examples  does  not 
mean  that  concepts  are  really  independent. 
Indeed,  independence  is  only  an  assumption, 
that  can  be  adequate  or  not  according  to  the 
problem  at  hand.  This  means  that  it  is  also 
possible  to  learn  dependencies  between  concepts 
even  though  instances  show  no  explicit  form  of 
such  dependencies.  A  typical  example  of  learning 
dependencies  between  concepts  is  met  in  the 
area  of  statistical  causal  inference  (Esposito  et 
at,  1993a). 

When  concept  dependencies  are  intrinsically 
acyclic.  In  this  case  we  can  use  dependence 
hierarchies,  that  is  directed  acyclic  graphs  whose 
nodes  are  concepts  to  learn,  to  represent  them 
(see  Figure  6).  The  order  in  which  concepts 
should  be  learned  is  completely  defined  by  the 
dependence  hierarchy,  in  particular  the  concepts 
in  the  lowest  level  of  a  dependency  hierarchy 
have  to  be  learned  first,  since  their  definition 
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Cj  Cj  ...  Cj 

Figure  6.  Anexampleofdependence hierarchy 
between  concepts 


does  notdependon  other  concepts.  Some  authors 
name  such  concepts  golden  points  (Baroglio  and 
Giordana,  1992)  but  we  prefer  the  term  minimally 
dependent  concepts.  In  some  studies  on  inductive 
learning,  the  necessity  to  learn  dependent 
concepts  has  been  implicitly  recognized.  For 
instance,  many  studies  on  inductive  logic 
programming  have  reported  Shapiro's  problem 
of  learning  two  Prolog  predicates,  append  and 
reverse  (Shapiro,  1981).  Obviously,  in  all  these 
studies  the  predicate  reverse  was  learned  only 
after  that  the  predicate  append  had  been  learned. 
Forinstance,  FOIL  can  learn  the  predicate  append 
in  57.9  secs  and  then  requires  29.0  secs  for 
learning  the  predicate  reverse  on  a  SUN  station 
4/25.  Not  surprisingly,  if  the  concepts  are  learned 
independently,  the  rule  producedfor  the  predicate 
reverse  is  no  longer  correct  and  the  time  required 
for  learning  is  about  78.2  secs.  This  is  another 
proof,  if  necessary ,  of  the  problems  with  accuracy 
of  hypothesis  and  efficiency  of  the  learning 
process  that  the  independence  assumption  can 
cause. 

5.2  Contextual  problems  in  structural 
domains 

The  problem  of  contextual  learning  has  its 
natural  setting  in  structural  domains,  that  is 
those  domains  in  which  concepts,  observations 
and  background  knowledge  can  be  effectively 
described  by  means  of  first-order  logic  rather 
than  propositional  calculus.  The  need  to  move  to 
a  more  powerful  representation  language  is  that 
observations  can  be  made  up  of  several  parts, 
thus  the  representation  of  one  of  such 
observations  consists  of  a  number  of  attributes 
concerning  each  subpart  as  well  as  of  different 
relationships  between  subparts. 

Relationships  can  be  easily  represented  as 
first-order  logic  predicates,  but  can  hardly  been 
expressed  by  means  of  propositional  calculus 
(Quinlan,  1990).  There  are  several  examples  of 
structural  domains  presented  in  the  machine 
learning  literature.  Winston's  arch  problem 


(1970)  and  Michalski's  problem  of  trains  going 
east  or  west  (1980)  are  just  some  of  them.  A 
common  aspect  to  all  these  problems  is  that  we 
want  to  predict  aproperty,  the  class,  that  concerns 
the  objects  as  a  whole.  However,  there  are  a 
number  of  problems  in  which  rules  that  allow 
some  subparts  of  an  object  to  be  classified  are 
sought.  Many  of  such  problems  can  be  easily 
found  in  the  literature  on  computer  vision.  Here 
are  two  of  them: 

-  Scene  labelling  problem:  Given  a  picture 
taken  from  a  scene,  say  an  office,  it  is  required 
to  identify  some  objects  in  the  scene.  A  low- 
level  computer  vision  system  preprocess  the 
picture,  segments  it  and  generates  numerical 
or  symbolic  features  for  each  segment  as  well 
as  a  description  of  how  segments  are  related 
(for  instance,  a  segment  is  included  into 
another,  or  is  adjacent  to  another).  At  this 
stage,  it  is  necessary  to  associate  each  or 
some  segments  with  the  name  of  the  object  in 
the  scene  they  represent.  The  problem  is 
furthcrly  complicated  by  the  fact  that  the 
picture  of  an  object  can  be  fragmented  into 
several  segments. 

-  Edge  orientation  problem:  Given  a  picture  of 
one  or  more  objects,  in  which  the  edges  of  the 
objects  have  already  been  detected,  it  is 
required  to  identify  the  orientation  of  each 
segment  (or  line)  composing  the  edges.  This 
information  can  help  to  reconstruct  the  3-D 
scene  and  to  understand  how  objects  are 
related  in  terms  of  3-D  features  (for  instance, 
an  object  is  in  front  of  another  one).  Typical 
examples  of  line  labels  are:  border,  convex 
and  concave  edge. 

Other  problems  of  this  kind,  which  are  named 
labelling  problems,  can  be  found  in  the  area  of 
speech  recogiution  (identification  of  words  or 
phonemes  in  an  acoustic  signal),  fault  diagnosis 
as  well  as  in  the  game  theory. 

(jenerally  speaking,  in  a  labelling  problem 
we  are  given  a  complex  object  O  which  can  be 
decomposed  in  a  set  U=  { Uj, . . . ,  u^ }  of  units  each 
of  which  can  be  named  wi  A  a  label  I  taken  from 


asetL={l,, ..  .4„)  oflabels.The  description  ofO 
is  given  by  Ae  description  of  each  single  unit  in 
terms  of  a  set  of  attributes  A={aj,  ...,a^}  as  well 
as  by  Ae  description  of  Aeir  relationships  m 
terms  of  a  set  of  relations  R=  { r, , . . . ,  r^ } .  The  aim 
is  Aat  of  associating  Ae  right  label  to  some  (or 
all)  units  of  O.  As  said  above,  structured  objects 
can  beeasilyrepresentedin  first-order  logic.  For 
instance,  by  assuming  Aat  all  relations  in  R  are 
binary,  we  can  represent  O  as  a  conjunction  of 
(typically  positive)  literals  involving  binary 
preAcates: 

a,(u,. c„)  A  a,(u,.  c,j)  A ...  A  a,(u,,  c  J  a  ...  a  a^(u  .  c^) 
A  r,(u,.  u,)  A  ...  A  r,(u^,,  u,) 
where  are  constants  representing  values  of  Ae 
attributes  while  u.  are  constants  used  to  denote 
each  object.  Obviously,  when  labels  are  known, 
Ae  above  conjunction  will  be  conjomt  wiA  as 
many  literals  l(u.,  L)  as  Ae  number  of  labelled 
units  (here  1  is  the  label  predicate  and  the  constants 
r  are  elements  of  L).  Given  a  set  of  instances  of 
a  structured  object  O  for  which  some  (or  all) 
units  have  been  labelled,  Ae  learning  problem  is 
Aat  of  learning  rules  Aat  allow  for  labelling 
Aeir  units.  For  mstance,  given  a  set  of  pictures  of 
an  office  in  which  we  have  labelled  Ae  segments 
wiA  Ae  name  of  some  objects,  such  as  chair, 
desk  and  so  on,  we  want  to  Ascover  some  rules 
for  labellmg  segments  so  Aat  we  will  be  able  to 
recognize  objects  m  a  new  picture. 

Typically,  m  labelling  problems  it  is  not 
convement  to  learn  concepts  mdependently  of 
each  oAer  since  concepts  are  Aem selves  strongly 
related  or  mutually  constrained.  The 
independence  assumption  on  which  most 
learning  algorithms  rely  is  not  generally  adequate, 
and  it  may  lead  to  learn  as  many  rules  as  Ae 
number  of  mstances  of  a  given  concept. 

Luckily,  m  some  labelling  problems,  it  may 
happen  that  concept  dependencies  are 
mtrinsically  acyclic  and  it  is  possible  to  define  m 
a  qmte  natural  way  a  dependence  hierarchy.  In 
Ais  case  it  is  possible  to  start  Ae  learning  process 
wiA  Aose  concepts  m  Ae  hierarchy  Aat  appear 
at  the  very  bottom  (minimally  dependent 
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concepts).  The  language  of  observations  will 
contain  aJioperational  (or  extenskmally  defined) 
predicates,  a  and  r  (ground  literals),  while  the 
language  of  hypotheses  can  be  allowed  to  contain 
non-operational  (or  intensionally  defined) 
predicates  as  well.  It  is  worthwhile  to  observe 
that  minimally  dependent  concepts  are  learned 
independently  of  each  other,  thus  traditional 
learning  systems  that  induce  structural 
descriptions  can  be  effectively  employed  in  this 
step.  Generalizations  of  the  learned  concepts, 
say  Cj,  Cj, ...,  C^,  will  be  a  set  of  Horn  clauses 
of  the  type: 

l(X.l,)<-<p(X,Z,.Z, . ZJ 

where  9(X,  Z, ,  Tlj, . . . ,  Z^)  denotes  a  conjunction 
of  literals  concerning  operational  and  non- 
operational  predicates  in  which  X,  Zj,  Zj, . . .,  Z^ 
are  the  only  variables  that  occur. 

When  the  minimally  dependent  concepts  have 
been  learned,  it  is  possible  to  learn  those  concepts 
that  depend  directly  on  C,,  Cj, ...,  C^,  but  now 
the  language  of  observations  will  contain  the 
predicate  1  whose  extensional  definition  is  given 
by  all  possible  instantiations  l(u.,l.),  ie  {1,2,..., 
k},  used  as  positive  examples  in  the  previous 
step.  If  the  dependence  hierarchy  is  well  defined, 
the  rules  produced  for  concepts  learned  in  this 
step  will  contain  at  least  one  occurrence  of  the 
predicate  /.  More  precisely,  if  the  concept 
directly  depends  on  the  concept  Cj  and  Cj,  we 
expect  that  the  rule  for  will  contain  at  least 
one  literal  of  the  set  {1(Y,  1,),  -i  KY,  1,)}  as  well 
as  one  literal  of  the  set  {l(Y,lj),  -il(Y,  y}.  Such 
an  expectation  is  explained  by  the  fact  that 
concepts  Cj  and  are  useful  toexplainorpredict 
C^^j.  When  previous  conditions  do  not  hold,  it 
means  that  either  the  dependence  hierarchy  is 
partly  over-specific  (some  concept  dependencies 
do  not  occur  in  reality)  or  evidence  in  the 
observations  is  not  enough  to  detect  such 
dependencies  (this  is  particularly  true  when 
dependencies  are  probabilistic  rather  than 
deterministic)  or  the  search  strategy  of  the 
leanting  system  is  simply  limited  in  some  way. 
The  multistep  learning  process  continues  until 


all  concepts  in  the  dependence  hierarchy  have 
been  learned. 


S3  Application  to  document  understanding 


In  previous  experiments  concepts  have  been 
assumed  to  be  independent.  In  this  section  we 
will  show  some  experimental  results  of  the 
attempt  to  learn  contextual  rules.  The  dependence 
hierarchy  between  concepts  is  reported  in  Figure 
7.  The  reason  for  this  definition  can  be  partly 
explained  in  terms  of  spatial  reasoning.  In  fact, 
when  the  logotype  has  been  recognized,  the 
recognition  of  the  contiguous  logical  components 
should  be  easier.  In  this  case  the  contiguous 
logical  components  are  sender  (which  is  above 
the  logotype)  and  which  is  to  the  right  of  the 
logotype.  When  the  re/has  been  recognized  then 
the  date,  which  is  to  its  right,  should  be  more 
easily  recognized.  Finally,  when  sender,  r^and 
date  are  recognized  the  identification  of  the 
receiver  should  be  easier.  The  reason  for  which 
the  logotype  has  been  chosen  as  minimally 
dependentconcept  is  that  the  logotype  is  generally 
the  only  block  of  type  picture  and,  moreover,  it 
is  never  (or  rarely)  fragmented  or  grouped  with 
other  blocks,  thus  it  should  be  the  simplest 
concept  to  learn.  After  having  defined  a 
dependence  hierarchy,  FOCL  was  used  to 
generate  the  rule  for  the  concept  logotype  by  using 
only  the  original  set  of  predicates  (extensionally 
and  intensionally  defined).  Then,  we  ran  FCXX 
to  generate  the  rules  for  the  concepts  sender  and 
refby  using  the  original  set  of  predicates  together 
with  the  predicate 

learned_description_of_logic_type-logo 
previously  learned.  Then,  FOCL  is  asked  to 
generate  the  rule  for  the  concept  date  by  using 
the  original  predicates  plus 
learned_de’scription_of_logicjype-logo. 
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Figure  7.  The  hierarchy  of  concept  dependencies 


learned_description_of  logic  type-sender  and 
learnedjkscription  of  logic  type-ref. YinaHy, 
FOCL  learned  the  concept  receiver  from  both 
the  original  predicates  and  all  the  learned 
predicates. 

Experimental  results  are  shown  in  Table  VD. 
First  of  all  it  should  be  noticed  that  the  average 
error  rate  is  6.4%,  that  is  the  lowest  rate  we  got 
in  all  the  experiments  (about  2.0%  less  than  the 
basic  case).  Such  an  increase  of  predictive 
accuracy  is  a  counter-intuitive  result,  since  when 
a  prediction  of  a  rule  depends  on  the  prediction 
made  by  another  rule  it  may  happen  that  an  error 
in  the  first  rule  to  fire  is  propagated  to  the  second 
(dependent)  rule.  Thus,  the  average  error  rate  for 
contextual  rules  shcold  be  higher  than  the  error 
rate  for  independentrules.  However,  this  problem 
does  not  occur  when  concepts  are  really 
dependent  and  the  dependence  hierarchy  is  well 
defined,  since  this  error  propagation  effect  should 
not  occur  frequently.  Actually,  this  is  what  we 
observed  in  our  experiments. 

Secondly,  we  did  not  get  problems  of  non¬ 
convergence  or  infinite  recursion,  since 
contextual  rules  are  easier  to  learn.  Indeed,  the 
total  number  of  tested  literals  is  124,929  which 
is  less  than  the  number  of  tested  literals  in  the 
other  experiments  in  which  4  clichds  were  used 
to  generate  rules.  Such  a  decrease  of  learning 
time  is  counter-intuitive  as  well  since  for  some 
concepts  the  search  space  is  wider  and  not  smaller 
due  to  the  introduction  of  new  predicates.  The 
rationale  for  these  results  is  that,  when  concepts 
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Results  of  the  sixth  experimentation: 
contextual  learning 
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are  really  dependent,  information  on  the  context 
should  help  to  learn  more  quickly. 

Thirdly,  sometimes,  even  if  concepts  are  really 
dependent,  FOCL  is  not  able  to  capture  such 
dependencies  due  to  its  own  search  strategy. 
Actually,  this  is  a  problem  of  any  traditional 
learning  system  rather  than  a  fault  of  FOCL.  For 
instance,  this  problem  cari  be  observed  in  FOIL 
as  well.  Indeed,  in  order  to  express  a  dependence 
between  two  concepts,  relations  between 
variables  representing  different  logical 
components  have  to  be  introduced  in  a  clause  (in 
the  case  of  document  understanding,  those 
relations  are  geometrical  relations  such  as  above, 
on- top,  etc.).  Unformnately,  relations  have  quite 
often  a  small  or  even  negative  mformation  gain 
if  taken  alone,  so  a  greedy  strategy  like  hill¬ 
climbing  will  never  consider  them,  at  least  in  the 
early  steps  of  the  generation  of  a  hypothesis.  As 
a  consequence,  dependence  between  concepts 
will  not  be  considered  in  the  first  steps  of  the 
learning  process  and  the  final  result  will  be 
strongly  influenced  by  the  initial  choices.  As 
already  pointed  out  in  section  3,  such  a  problem 
can  be  panially  solved  by  means  of  either 
determinate  literals  iQuinldn,  1991),  as  in  FOIL 
orrelational  cliches  as  in  FOCL.  This  is  the  main 
reason  for  which  we  still  preferred  to  use  clich6s 
in  this  last  experimentation  even  if  previous 
experimental  results  were  unsatisfactory  when 
cliches  were  used. 

6.  Conclusions 

In  this  paper,  the  problem  of  document 
understanding  has  been  approached  by  means  of 
inductive  learning,  in  particular  learning  fi-om 
examples.  This  approach  represents  a  novelty  in 
the  field  of  document  understanding  andpresents 
the  advantages  of  generality  and  easy  customizing 
of  the  document  management  system. 

It  is  worthwhile  to  notice  that  in  our  approach 
the  content  of  text  blocks  is  not  exploited  to 
recognize  each  single  logical  component.  On  the 
contrary,  the  recognition  of  logical  components 
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based  on  ite  layout  structure  of  a  dkxument 
allows  an  OCR  system  to  read  only  SOTie  specific 
components  of  interest  for  a  particular 
application. 

Given  a  set  of  documents,  training  examples 
are  descriptions  of  some  logical  cc»np<ments  of 
the  (kxiunents.  Therefore,  the  problem  is  that  of 
letuning  recognition  rutes  ftsr  subpairs  of  the 
logical  structure  of  a  docuirwnt  This  problem  is 
a  paniculv  kind  of  labelling  problem  in  which 
sublets  of  a  complex  object  have  to  be  labelled 
The  sectmd  iKJvelty  ttf  this  paper  is  that  of 
criticizing  the  independence  assumptkm  made 
by  almost  all  learning  systems  and  proposing  a 
new  learning  struegy,  namedconiextual  learning, 
in  which  a  dependence  hierarchy  of  concepts  is 
explotied  to  define  both  the  order  in  which 
concepts  should  be  learned  and  the  proper 
observation  language. 

The  paper  erripihcally  compares  the  tradtbonal 
learning  strategy  in  which  the  independence 
assumption  is  made  with  such  a  conKxtual 
learnmg  strategy .  For  this  teasoTk  FOGL  a  system 
that  learns  Horn  clauses,  was  employed  to 
generate  the  rect^oon  rules  for  five  logical 
comptments  of  a  set  single  page  documents, 
namely  copies  oi  letters  sent  by  an  Italian 
comfumy.  Results  concerning  the  first  strategy 
show  that  the  introduction  of  background 
know  ledge,  reiatkmal  chch^  and  lecursioo  does 
not  sigrufkamly  improve  the  predictive  accuracy 
of  the  whole  set  of  rules,  even  though  some 
single  concepts  arc  belter  recognized  by 
introducing  one  of  such  variants  of  the  basic 
case.  The  definition  of  a  hierarchy  cf  concept 
dependencies,  based  on  the  spatial  comigutty  erf 
logical  components.  aDows  FOCL  toeasUy  learn 
contextual  rules  whose  average  error  is  the  lowest 

These  results  have  s{raned  us  to  investigaie 
the  fffobiem  of  leaminf  dependencies  between 
concepts,  namely  building  a  dependence 
hierarchy  between  concepts,  which  is  faesenily 
provided  by  the  teacher.  For  this  purpose,  we 
adopted  the  same  mult^raiegy  learning  approach 
followed  for  the  problem  of  document 


classification  to  the  problem  of  document 
understanding.  In  fact,  foreach  block  it  is  possible 
to  define  a  set  of  numerical  features  describing 
geometrical  characteristics  of  a  block  as  well  as 
penxmage  of  black  pixels,  numberof  transitions 
black-white,  and  so  on.  Such  features  can  be 
appropriately  managed  by  a  parametrk  classifier, 
namely  Fisher's  linear  discriminant  funedems. 
whose  results  the  training  phase  can  be 
exploited  in  several  ways.  One  way  consists  in 
mapping  the  classification  results  of  the 
(hschiTuitam  fimetions  intoa  new  {xedicaie  v^ich 
is  in  turo  used  the  system  that  teams  Horn 
clauses.  The  other  way  is  that  of  exploiting 
infonnatioQ  on  the  discrimmatory  power  erf  the 
parametric  classifier  in  order  to  de^  the  order 
in  which  concepts  dxMld  be  teamed,  starting 
with  those  concepts  that  can  be  easily  recognized. 
Some  preliminary  results  are  presented  in 
(Eqiosito  et  al.,  1993b). 
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Abstract 

This  pi^per  describes  work  in  progress  to- 
wards  a  comimter  riskm  system,  that  learns 
to  recofnine  three-dimensional  nooripd  ob¬ 
jects,  and  to  locate  and  orient  them  in  space. 
Similarly  to  its  two-dimensional  predecessor 
GEST,  the  system  uses  gri4>hs  to  describe 
shape  structure,  and  relies  on  a  moltistr^ 
egy  approach  to  learn  models  of  classes  d  3- 
D  shapes.  A  key  new  component  of  this  sys¬ 
tem  is  a  constructive  induction  method  that 
discovers  geometric  relations  among  parts  in 
a  tluree-dimensional  shape. 

Key  words:  Computer  rision,  graph, 
clustering,  3-D  shape  recognition. 

1  Introduction 

Structural  models  composed  of  parts  and 
relations  have  been  used  for  more  than  20 
years  as  structural  models  of  shsfie  (Barrow 
ri  af,  1972;  Shapiro,  1980;  Nachman,  1984). 
Such  models  have  proved  espedally  suitabfo 
m  computer  vision,  where  recognition  and 
interfwetjUion  are  the  mam  goals  of  model¬ 
ing,  and  it  is  likely  that  Uwy  will  find  use 
in  other  shape  related  fields,  such  as  image 
compression  w  computer  graphics. 

Construction  of  structural  modela  requires 
learning.  A  computer  vision  system  that  ap¬ 


plies  multistrategy  learning  to  form  struc¬ 
tural  noodeb  has  been  described  in  (Segm, 
1993).  This  system,  named  GEST,  doKribes 
structure  using  gnq>hs.  It  learns  modeb  of 
nonripd  2-D  shapes  from  numerical  exam¬ 
ples,  without  user’s  help.  Using  these  mod¬ 
eb  it  recognises  2-D  objects  in  video  input 
and  computes  their  pose  in  real  time.  GEST 
works  well  enough  that  b  b  being  used  as 
an  input  device  that  allows  one  to  control 
graphics  applications  with  hand  gestures. 

The  success  of  GEST  in  two  dimensions  gives 
incentive  to  research  towards  a  multbtrat- 
egy  learning  system  that  operates  in  a  three- 
dimensional  world:  a  GE^-3D.  Its  current 
status  b  described  in  thb  paper.  A  key 
new  component  of  GEST-3D  b  a  represen¬ 
tation  for  3-D  geometric  reUtkos,  that  per¬ 
mits  multipb  types  of  parts  and  rriatkms, 
and  a  constructive  induction  method  that 
learns  3-D  relations  from  data. 

2  GEST-3D 

GEST-3D  will  learn  to  recogniK  three- 
dimensiooal  nonrigid  objects  from  incom¬ 
plete  infermatioo,  and  to  estimate  object’s 
pose,  that  b  the  locatimi  and  orientation 
in  3-D  space.  Thb  system  will  use  a  mul- 
tbtrategy  learning  approach,  aaalofous  to 
that  in  GEST,  to  construct  modris  of  3- 
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D  shape  ciMses  and  to  infer  pose  estima- 
t<»s.  The  learning  module  of  GEST-30, 
ahosm  schematically  in  Figure  1,  consists  of 
three  parts:  C<»istructise  Induction,  Graph 
Learning,  and  Pose  I<eaming.  The  first  part 
uses  a  new  constructive  induction  method 
to  discover  symbolic  relatkms  in  a  set 
numerically  represented  instances  <d  three- 
dimenimnud  shapes.  These  relations  are  used 
as  the  basic  {mmitives  in  the  grs^h  learn¬ 
ing  part,  which  builds  a  structural  model  Ux 
each  claM  ci  3-D  shapes.  The  third  part 
learns  parametric  functkms  for  computing 
pose,  using  statistical  methods  ci  robtist  es¬ 
timation.  The  results  of  aU  three  learning 
parts  are  collected  in  a  model  library,  which 
by  an  object  recogniser  to  interpret  three- 
dimensional  shapes. 

Since  graph  representation  is  independent  of 
space  dimension,  the  graph  learning  meth¬ 
ods  (Segen,  1990}  used  in  GEST  are  taken 
with  almost  no  changes.  The  two  remaining 
parts  are  based  on  new  research.  Tbe  con¬ 
structive  induction  approach  used  for  learn¬ 
ing  3-D  relations  is  presented  in  the  follow¬ 
ing  sectkms  of  this  paper.  Methods  used  for 
learning  three  dimensional  pose  will  be  de¬ 
scribed  separately. 

3  Constructive  Induction  of 
Relations 

A  geometric  relation  represents  a  range  of 
values  of  displacement  and  rotation  between 
parts,  or  a  relative  pose.  In  a  rigid  object 
these  values  are  nearly  constant,  but  in  a 
nonrigid  object  they  can  vary  between  dif¬ 
ferent  object  instances,  and  range  of  vari¬ 
ations  can  be  different  few  different  pairs  of 
parts.  The  goal  of  the  constructive  induction 
process  is  to  discover  geometric  relatioos  in 
numerical  data,  and  represent  them  as  sym- 
bds. 


Shape  Instances 


Rgure  1:  Learning  in  GEST-3D 


This  constructive  induction  method  exam¬ 
ines  a  set  of  training  shapes  for  natunl 
groufangs  of  part  types  and  their  rdative 
poses  *m  pairs  of  nearby  parts.  Each  iden¬ 
tified  group  represents  a  geocoetric  relation, 
involving  two  parts  and  a  distribuikm  of  pa¬ 
rameters  of  their  relative  pose.  A  pair  of 
primitive  parts  jeaned  by  one  ci  tl^  discov¬ 
ered  relations  is  then  trmted  as  a  higher  or¬ 
der  psrt,  and  the  grouping  process  rqreats, 
pving  rise  to  a  hierarchy  of  parts  and  rela¬ 
tioos,  similar  to  tbe  hierarchy  proposed  in 
(Mart  and  Nishihara,  1978). 
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4  Primitive  Parts 

An  insUnce  thape  i«  initially  represented 
as  a  collection  of  primitive  parts.  These 
parts  do  not  have  to  be  mutually  exclusive, 
th^  is  some  parts  may  overlap.  The  collec¬ 
tion  of  parts  does  not  have  to  cover  the  en¬ 
tire  shs^.  The  literature  <m  shape  provides 
an  abundance  of  alg<mthmi  and  techniques 
for  extracting  parts  from  a  two-  three- 
dimenn<»ial  sh^pe.  Some  methods  partition 
the  primary  representation,  e.g.  a  curve,  stir- 
face,  or  volume  into  homogeneous  regions, 

i.e.  regkms  whose  local  properties  are  a|>- 
proodmately  constant.  Other  methods  iden¬ 
tify  singularities,  that  is  points  bound¬ 
aries  of  the  primary  representation  which  are 
unique  within  their  neighborhood,  such  as 
edges,  comers,  local  extrema  of  curvature, 
or  critical  points  of  a  surface.  Most  of  the 
published  methods  can  be  adapted  to  gener¬ 
ate  parts  that  satisfy  the  requirements  of  the 
approach  described  in  this  paper.  This  ap¬ 
proach  also  allows  one  to  mix  together  parts 
generated  by  different  methods. 

Deflnitioo:  A  primitive  part  p  is  a  triple 
|type,tnv,var],  where  type  is  a  symbol  from 
a  finite  alphabet,  mv  and  var  are  real  valued 
vectors,  such  that  if  T  is  a  rigid  transforma¬ 
tion,  then  T(p)  3s  (type,  inv,  T(v«r)] 

The  type  symbol  specifies  the  format  and 
interpretatkm  for  mv  and  ver  vectors,  and 
it  is  used  to  distinguish  parts  generated  by 
different  extraction  methods.  The  mw  vec¬ 
tor  contains  parameters  that  are  invariant 
under  ri|^  transformaikms  of  the  coordi¬ 
nate  system.  These  parameters  are  specific 
to  the  type  of  a  part,  which  also  remains 
constant  under  rigid  transformations.  Some 
parts,  such  as  a  single  point,  have  no  invari¬ 
ant  parameters,  that  is  the  mv  vector  have 
dimenskm  sero.  The  ver  vector  consista  of 


parameters  that  change  with  rigid  transfor- 
maticMos.  This  vector  carries  infcMrmation  re¬ 
lated  to  the  part’s  pose,  that  is  part’s  posi¬ 
tion  and  orientation  in  space.  In  2-D  space 
pose  consists  of  three  values;  two  coordi¬ 
nates  of  position  and  the  orientation  angle. 
In  3-D  space  pose  has  six  dimensions:  three 
position  coordinates  and  three  angles.  If  the 
pose  of  a  part  can  be  determined  baaed  on 
the  fenm  of  the  part,  then  the  var  vector  is 
the  pose.  However,  for  parts  with  symme¬ 
tries  pose  caxuKH  be  determined  completely, 
but  it  can  be  restricted  to  a  number  of  de¬ 
grees  of  freedom  (for  continuous  symmetries) 

to  a  number  of  values  (for  discrete  sym¬ 
metries).  The  approach  described  here  om- 
siders  only  parts  with  continuous  symmetries 
such  as  line,  or  sphere.  Foe  such  parts  the 
ver  vector  contains  the  maximal  number  of 
parameters  that  constrain  the  pose. 

Definition:  A  representation  by  parts  of 
shape  5  is  a  set  of  primitive  parts  P(S)  » 
{Fi.P*.  }• 

The  function  P  in  this  definition  stands  for  a 
method  used  to  segment  or  to  extract  parts 
from  5.  The  representation  F(5)  should  sat¬ 
isfy  the  following  properties: 

1.  Invariance:  Without  presence  of 

a  rigid  transformatioa  should  not 
change  the  representation.  This  means 
that  for  any  rigid  transformatioD  T, 
P(S)  »  {jh.Ps,  .}  implies  P(T(S))  = 
{r(p»),r(p,)...}. 

2.  Locality:  A  part  descriptioa  slmuld 
not  be  affected  by  sh^>e  dianges  out¬ 
side  of  the  part’s  immediate  ndf^bw- 
hood. 

3.  Stability;  The  representatkm  slKwld 
0(^  be  significantly  affected  by  random 
noise  in  the  image.  This  means  that 
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if  5t  and  5a  are  two  nmty  images  of 
Uie  tame  th*pe,  tlien  there  exiata  a 
oQe-UMme  mapping  between  Urge  aub- 
aeta  of  P{S\)  and  P(^)>  where  the 
awreaponding  parta  haee  identical  type 
aymboia,  and  dmilar  me  and  eer  pa- 
rametera. 

5  Constructing  New  Parts 

All  pairwiae  relations  might  contain  uaefol 
in&vmatioo,  hot  examining  all  such  reUtkea 
in  a  large  aet  of  parta  can  be  too  oMtly. 
Therefiwe,  direct  pairwiae  relatkma  are  re- 
atricted  <mly  to  pairs  ci  nearly  parta.  Pairs 
of  nearby  parta  are  alao  uaed  to  oooatruct 
new  parta,  called  campotiU  parta,  that  are 
treated  just  like  the  loimitive  parta.  One 
can  find  binary  reUtiona  between  composite 
parta,  and  combine  a  pair  c4  composite  parta 
into  a  new  cmnposite  part. 

One  can  think  ci  a  composite  part  as  a  root 
of  a  binary  tree.  All  non-leaf  nodes  of  this 
tree  are  composite  parts,  and  the  leases  are 
primitive  parta.  This  binary  tree  defines  the 
composite  part  at  its  root,  and  determines 
the  order  of  c^mratiooa  needed  to  construct 
it.  Depth  of  this  tree  determines  the  (esd  of 
the  part  at  the  root.  The  level  of  a  primitive 
part  is  0,  a  level-1  part  is  constructed  from 
a  pair  of  primitive  parta,  two  level-1  parta 
inoduce  a  level-3  part,  and  so  on.  The  cur¬ 
rent  method  combines  <mly  parts  with  equal 
levels,  so  a  ramposite  part  at  level  n  has  2* 
primitive  parta  at  its  leaves,  i.e.  it  is  a  bal- 
anmd  tree.  This  restriction  is  used  only  for 
computational  convenieom,  and  it  may  be 
relaxed  in  the  future. 

A  composite  part  is  represented  the  same 
way  as  a  {wimitive  part  m 
(i]rps,«nv,vsr].  Composite  parta  are  con¬ 
structed  botUwo-ttp,  one  level  at  a  lime  up 


to  a  preset  maximal  level. 

The  constructkm  process  repeats  the  follow¬ 
ing  sequence  of  steps  fat  each  level: 

1.  Cluster  the  mv  vectors  of  all  parts  at 
level  k,  separately  each  part  type, 
and  assign  a  unique  label  to  each  clus¬ 
ter. 

2.  Assign  to  each  part  the  label  oi  ita 
nearest  duster,  or  NIL  if  the  nearest 
duster  is  too  far. 

3.  Terminate  if  k  is  equal  to  a  preset  max¬ 
imal  levd. 

4.  Find  pairs  of  neighboring  parta  with 
non  NIL  labeb,  and  construct  k  1 
levd  parta  by  applying  a  compoatfum 
operation. 

Tbe  result  of  tbe  above  construction  process 
is  a  set  of  dusters  tnv  vectors,  grouped  into 
part  types.  These  dusters  ate  saved  in  a  li¬ 
brary,  which  is  later  used  by  a  reoogniitoo 
program  to  assign  symbolic  labels  to  prim¬ 
itive  parts,  and  to  relations  among  parts. 
Tbe  key  element  of  tbe  ooostruction  procets 
is  tbe  composition  operation  (step  4)  which 
forms  new  composite  parts  from  pairs  of  ex¬ 
isting  parts.  This  operation  is  described  in 
detail  in  the  following  section. 

6  Composition  of  Parts 

The  compontktt  operation  is  applied  to  an 
ordered  pair  of  parts.  The  result  of  this  op- 
eratioo  u  a  composite  part  represented  as 
(type,  mv.iMw). 

The  type  of  tbe  result  of  compodtioD  is  a 
string  obiaiiMd  by  coocatenatmg  the  types 
of  components,  treated  as  strings.  The  tnv 
and  vsr  vectors  of  Um  result  are  computed 
from  tbe  ver  vectors  of  the  oompooents.  This 
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operatk»i  depends  on  tffmnuiry  type*  of  tiie 
annpooent  parts.  A  symmetry  type  speci¬ 
fies  a  group  of  rigid  transf<»mati<»s  which  do 
n<^  change  the  part’s  l^>pearance.  For  exam¬ 
ple,  an  infinitely  long  cylinder  has  two  con¬ 
tinuous  symmetries:  roUtion  around,  and 
translation  akmg  the  axis.  The  ^proach 
proposed  here  is  restricted  to  parts  with  con¬ 
tinuous  symmetries.  Discrete  symmetries, 
such  as  the  symmetries  of  a  cube  ntiay  be 
treated  in  future  extensions. 

The  symmetry  type  <A  a  part  is  a  function 
of  the  part  type.  The  symmetry  type  of  a 
primitive  part  has  to  be  given  defined  for 
each  primitive  part  type.  Symmetry  types 
of  composite  parts  are  determined  by  com¬ 
position  mles  in  Tables  2  and  4. 

0.1  Parts  in  S-D  Space 

Table  1  lists  six  symmetry  types  used  for 
three-dimensional  parts.  The  first  two 
cc^umns  slww  a  geometric  fnrm  with  a  given 
type  of  symmetry,  and  part’s  symbol.  Third 
column  shows  dimenskm  of  ear.  The  sym¬ 
metry  transformatkms  are  shown  in  fourth 
column  using  a  symbolic  notation,  where  nR 
which  means  rc^tion  about  n  axes,  and  nT 
means  an  n-diinensicnal  translation.  Tkble 
2  slwws  the  symmetry  type  and  the  num¬ 
ber  of  invariant  parameters  of  the  result  of 
composition,  for  all  combinations  of  compo¬ 
nent  symmetry  types.  In  most  cases  it 
is  possible  to  define  the  composition  oper¬ 
ation  in  several  equivalent  ways,  usiag  dif¬ 
ferent  expressions  for  the  computing  tas  and 
ver  terms  of  the  cesnposiUon  result  from 
the  components  parts.  The  equivalence  of 
such  ahemativo  fcamulations  naeans  that  the 
compositMn  result  for  one  of  the  formula- 
ttons  contains  enough  inlormatioo  to  com¬ 
pute  the  composition  result  of  any  other  for¬ 
mulation,  without  using  data  from  the  mm- 


Thble  1:  Part  types  in  3-D 


Part 

Sjrmb. 

|ver| 

S]rmm. 

Point 

P 

3 

3R 

Line 

L 

4 

IR-l-lT 

Plane 

S 

3 

IR-HT 

Point  on  line 

PL 

5 

IR 

an 

LS 

5 

IT 

Frame 

P 

6 

None 

ponent  parts:  Examples  of  definition  of  the 
aunpositioo  (^>eration  are  pven  below  for 
four  cases  from  T^le  2.  The  following  nota¬ 
tion  is  used  in  mese  examples;  P(x)  is  the 
point  specified  by  z,  for  symmetry  types  P 
and  PL.  L(x)  is  the  line  and  r(z)  the  unit 
vector  for  symmetry  types  L,  PL,  and  LS. 

Case  1  Operands  z,y,  both  of  type  P. 
Result  s  oi  type  PL: 

var  P(s)  is  the  midpoint  between  P(z)  and 
P(]f),  that  is  |(P(z)-f  P(y)).  L{s)  U  the  line 
ddfi]^  by  P(z)  and  P(y). 
mw  Distance  between  P(z)  and  P(]f). 

Case  3  Operands:  z  of  type  P,  y  of  type 
PL. 

Result  s  of  type  F: 

var  The  frame  origin  is  the  midpoint  be¬ 
tween  P(z)  and  P(y).  The  orientatioo  of 
the  first  axis  is  given  by  tbe  unit  vector 

P(z)-P(y) 

|/^*)  -  /^»)l 

The  remaining  axes  are  obtained  by  a 
Gramm-Schmidt  orthooormalisation:  Let 
as  r(y)  -  (r(y)  ■  ti)u  then  a  unit  vector 
(v)  3s  i^/|V^|  defines  the  second  axis;  tbe 
third  axis  is  le  »  u  x  v . 
taw  Distance  between  P(z)  and  L(y),  and 
signed  distance  from  P(y)  to  the  projection 
of  P(z)  on  L(y). 


n 


Tkble  2:  Part  compoaitioia  in  3-D 


Cennp. 

Res. 

mv| 

P.P 

PL 

1 

P,L 

P 

1 

“  P,$ 

PL 

1 

P4»L 

P 

2 

P 

2 

PJ 

F 

3 

L4- 

F 

2 

L.S 

P 

1 

L»PL. 

P 

3 

l,lS 

F 

3 

L.F 

P 

4 

■  s,i 

lS 

1 

S^L 

P 

2 

S.LS 

P 

2 

P 

3 

■pLJPT 

P 

4 

"PCjr 

f 

4 

“PET" 

P 

5 

tsxs 

P 

4 

~TSJ~ 

P 

S 

F,F 

F 

6 

Cam  3  Operands:  x,y,  both  of  typo  PL. 
Rosttlt  M  of  typo  F: 

vor.  Pramo  axes  aro  owapntod  aa  in  Case  2, 
unng  r(a)  4-  r(y)  instoad  of  r(y)  to  find  the 
Mcond  axis. 

msc  Distance  between  P(x)  and  P(y).  angle 
«•  ^ 

between  r{s)  and  u,  angle  between  r(y}  and 
tt  («  defined  as  in  Case  2),  and  angle  between 
the  two  plana  defined  by  a  line  throng 
P(x)  and  P(y)  and  vectors  r(x),  and  r{jf), 
or  angle  between  normals  to  thoe  planes. 

Case  4  Operands:  x.y.  both  of  type  F: 
Result  s  ol  type  F: 

ver  FVame  axes  are  onnpntcd  as  in  Case  1, 
using  the  sum  of  the  unit  vectors  from  aU 


Table  3:  Part  types  in  2-D 


Part 

Symb. 

|var 

Symm. 

Point 

P 

2 

IR 

Line 

L 

2 

IT 

Point  on  line 

PL 

3 

None 

Table  4:  Part  composition  in  2-D 


C(»np. 

Res. 

|»rm| 

P.P 

PL 

1 

P,L 

PL 

1 

P,PL 

PL 

2 

L,L 

PL 

1 

L,PL. 

PL 

2 

PL,PL 

PL 

3 

six  axes  of  z  and  y  instead  of  r(y)  to  find 
the  second  axis.  If  this  sum  is  0  then  any 
subset  of  the  axes  can  be  used, 
tmr  Six  parameters  of  rigid  transformation 
from  z  to  y. 

6.2  Parts  in  3-D  Space. 

Symmetry  types  in  2*D  form  a  subset  of  sym¬ 
metry  types  in  3-D.  This  subset  consists  of 
three  symmetry  types  listed  in  Tkble  3.  The 
compositioo  operation  in  2-D  is  d^ned  in 
Thble  4.  Each  case  of  2-D  composition  can 
be  derived  from  a  corresponding  3-D  case,  as 
shown  in  the  example  below. 

Case  5  Operands:  z,y,  both  of  type  PL  in 
2-D. 

Result  s  of  type  Pfr: 

mr.  P(s)  and  L(s)  conespond  to  the  origin 
and  the  first  axis  in  Case  3  above, 
mw  Distance  between  P(s)  and  P(y),  and 
the  two  aagla  between  vectors  r(z)  and 
r(y),  and  the  line  through  P(z)  and  P(y). 
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7  Ordermg  Pairs  of  Parts 

The  procedure  matching  the  itructural 
deicriptioiit  (Segen,  1990)  requires  the  ar¬ 
guments  of  binary  relations  to  be  ordered. 
To  order  a  pair  of  parts  (z,y)  the  follow¬ 
ing  three>step  procedure  is  used.  If  x  and 
y  have  different  tyx>es  then  they  are  ordered 
according  to  a  kadcogriqduc  order  of  their 
tjrpcs.  If  the  types  are  identical,  but  the 
parts  have  different  labels,  then  they  are  or¬ 
dered  their  labeb.  If  the  labels  are  the 
same  then  an  ordering  function  /(z,y)  is 
used.  An  ordering  function  must  have  the 
following  properties: 

1.  There  is  a  partial  order  relation  >  de¬ 
fined  on  the  range  of  /. 

2.  Generally,  /(*.y)  ^  /(y,z). 

3.  For  any  rigid  transformation  T, 
/(*.V)  >  /(ir.»)  implies  /(T*,Ty)  > 
/(Ty,Tz).  Of  course  this  is  satisfied 
if  /(».»)  *  /(7».7’y).  This  property 
ensures  that  the  cwder  specified  by  /  is 
invariant  under  rigid  transfonnatkoos. 

With  the  aid  ol  /  one  orders  parts  x  and  y 

••  (».»)  if  /(»•»)  >  /(y.*).  Mwi  »i««  wsa. 

If  neither  /(»,y)  >  /(y.x)  nor  /(y,*)  > 
/(a,y),  then  parts  cannot  be  wdercd. 

An  example  of  such  a  function  for  2-D  parts 
X  and  y  and  symmetry  types  L,  or  PL,  is  the 
signed  angle  between  part  orientatkms  r(x) 
and  r(y). 

An  example  for  3-D  parts  with  symmetry 
type  PL  is  the  first  angle  invariant  in  Case  3 
in  Section  6.1. 

Using  an  ordoing  function  presents  a  minor 
problem.  A  natural  cluster  that  intersects 
the  hypersurface  /(x.y)  /(y.z)  will  be 


split  into  two  clusters.  This  is  an  undesir¬ 
able  feature,  since  such  a  split  is  purely  arti¬ 
ficial.  Split  clusters  can  be  merged  using  the 
following  consolidation  procedure. 

For  each  cluster  Ci  form  an  inverted  clus- 
iex  —Ci,  that  find  a  cluster  G'  in  a  set 
Ci  l,Gi  -f  2,....,  which  is  nearest  to  — G». 
An  inverted  cluster  —C  is  a  cluster  formed 
by  reversing  the  order  of  omnpodtion  of  d- 
ements  ol  G.  In  most  cases  such  an  opetxr 
tion  is  a  function  of  cluster  param^ers,  so  it 
does  not  require  reprocessing  the  cluster  el¬ 
ements.  If  -Ci  and  G'  are  sufficiently  close 
thmi  merge  -Ci  into  G^  provide  a  pointer 
from  G,  to  C\  and  del^  all  clustm  that 
point  to  G,.  In  addition,  delete  any  cluster 
that  is  close  to  its  own  inverse. 

If  tb«  above  procedure  is  used  then  the  label¬ 
ing  step  is  modified  as  follows;  If  a  compos¬ 
ite  part  is  assigned  to  a  cluster  a  that  points 
to  a  cluster  h,  then  the  part  is  inverted  and 
receives  the  label  of  the  cluster  b. 

8  Extracting  Relations 

The  label  of  a  primitive  part  symbolically 
describes  an  invariant  property  (unary  rela- 
tioo),  such  as  sine  or  curvature.  The  labd 
of  a  composite  part  P  describes  a  binary  re- 
laikn  between  its  two  compooeol  parts  or 
children.  It  also  (kscribes  a  4ih-<wder  rvda- 
tion  ainoag  the  part’s  grandchildren  (if  any), 
Sth-order  relattoo  among  the  pe^  grand¬ 
children,  and  so  on,  until  it  finally  describes 
a  2*-ary  relation  over  a  set  of  primitive  parts. 

After  constructing  Um  omaposite  parts  up 
to  a  preset  level,  one  retains  <m]y  their  la¬ 
bels,  and  the  parent-child  links.  The  restilt¬ 
ing  structure  is  a  graph  with  labded  ver¬ 
tices,  that  are  grouped  into  layers  according 
to  their  depth.  The  leaves  of  the  graph  rep¬ 
resent  the  primitive  parts;  other  vertices  rep- 
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reteat  oaapotite  parts.  This  gri^ph  contuns 
all  the  infbrmatioa  about  the  shape,  that  is 
iised  for  recogoition  and  mterpretati<tt. 

9  Final  Remarks 

The  relation  constructor  has  been  pro¬ 
grammed  only  for  the  2-D  case,  (Segen,  1998; 
1989)  and  used  as  a  module  in  t^  GEST  sys¬ 
tem.  This  implementati<m  uses  (me  type  ci 
a  primitive  part:  a  local  extremum  <A  cur¬ 
vature  of  the  boundary  of  a  planar  shape. 
This  part  has  <me  invariant  parameter  the 
curvature.  The  twr  vector  contains  the  posi¬ 
tion  of  the  extremal  point,  and  the  direction 
of  the  curve  normal  at  this  point,  that  is  its 
symmetry  type  is  PL.  The  symmetry  type 
of  all  the  composite  parts  derived  from  these 
primitives  is  also  PL. 

A  first  3-0  implementati(m  will  use  cmner- 
tike  parts  oi  symmetry  type  PL,  that  are  ex¬ 
tracted  by  a  stmctural  stereo  vision  system. 
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Abstract 

Although  people  rely  heavily  on  visual  cues 
during  problem  solving,  it  is  non-trivial  to  in¬ 
tegrate  them  into  machine  learning.  This  paper 
rep<»ts  on  three  general  methods  that 
smoothly  and  naturally  incorporate  visual  cues 
into  a  hierarchical  decisiem  algorithm  for  game 
playing:  two  that  inteipret  pmirawn  straight 
lines  on  the  board,  and  a  third  that  uses  an  as¬ 
sociative,  hierarchical  pattern  database  fen-  pat¬ 
tern  recognitiem.  They  have  been  integrated 
into  Hoyle,  a  game  learning  program  that 
makes  decisttms  with  a  hierarchy  of  modules 
representing  individual  rational  and  heuristic 
agents. 

Key  words:  machine  learning,  game  play¬ 
ing.  hierarchical  dcciskm  algorithms,  visual 
cues,  pattern  recognition 

I.  iBtrodoctioo 

Since  the  early  work  of  Chase  and  Smwn,  re¬ 
searchers  have  noted  that  expert  chess  playen 
retain  thousands  of  patums  (Holding.  1 985). 


Jack  Gelfand 
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Frt^  Midgley 

Department  of  Computer  Science 
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ijg@phoenix.princeton.edu 

There  has  been  substantial  additional  work  on 
having  a  program  learn  specific  patterns  for 
chess  (Berliner,  1992;  Campbell,  1988;  Flann, 
1992;  Levinson  aiul  Snyda,  1991).  There  is 
conflicting  evidence  as  to  whether  or  not  ex¬ 
pert  game  players  learn  to  play  solely  by  as¬ 
sociating  appropriate  moves  widi  key  patterns 
detected  on  the  board,  but  it  is  believed  that 
pattern  recognition  is  an  imptntant  part  of  a 
nuiid>er  of  different  strategies  exercised  in  ex¬ 
pert  play  (Holding.  1985).  In  Al,  visual  cues 
have  previously  dememstrated  their  power  as 
explicit  search  control  directives  and  as  hand- 
selected  terms  in  an  evaluation  function 
(Gelemter.  1963;  Samuel.  1963).  Learned  vi¬ 
sual  cues  have  also  been  derived  ht»n  goal 
stales  with  a  predicate  calculus  representation 
(Fawcett  and  Utgoff,  1991;  Yee,  et  al.,  1990). 

This  paper  integrates  the  pattern  recognition 
and  the  explanatmy  heuristics  duu  experts  use 
into  a  program  called  Hoyle  duu  learns  to  play 
two-person,  perfect  infmmation,  finite  board 
games  against  an  external  expert.  As  in  die 
schematic  of  Figure  1,  whenever  it  is  Hoyle's 
tom  to  move,  a  hierardiy  of  resource-limited 


procedures  called  Advisors  is  provided  with 
the  current  game  state,  the  legal  moves,  and 
any  useful  knowledge  (described  below) 
already  acquired  about  the  game.  There  are  22 
heuristic  Advisors  in  two  tiers.  The  first  tier 
sequentially  attempts  to  compute  a  decision 
based  upon  correct  knowledge,  shallow 
search,  and  simple  inference,  such  as 
Victory’s  “make  a  move  that  wins  the  contest 
immediately.”  If  no  single  decision  is  forth¬ 
coming,  then  the  second  tier  collectively 
makes  many  less  reliable  recommendations 
based  upon  narrow  viewpoints,  like  Material’s 
“maximize  the  number  of  your  markers  and 
minimize  the  number  of  your  opponent’s.” 
Based  on  the  Advisors’  responses,  a  simple 
arithmetic  vote  selects  a  move  that  is  for¬ 
warded  to  the  game-playing  algorithm  for  ex¬ 
ecution. 


Figure  I:  How  Hoyle  makes  decisions. 

The  program  learns  from  its  experience  to 
make  better  decisions  based  on  acquired 
useful  knowledge.  Us^l  knowledge  is  ex¬ 
pected  to  be  relevant  to  future  play  and  may 
be  correct  in  the  frill  context  of  the  game  tree. 


Examples  of  useful  knowledge  include  rec¬ 
ommended  openings  and  states  from  which  a 
win  is  always  achievable.  Each  item  of  useful 
knowledge  is  associated  with  at  least  one 
learning  algorithm.  The  learning  methods  for 
useful  knowledge  vary,  and  include  explana¬ 
tion-based  learning,  induction,  and  deduction. 
The  learning  algorithms  are  highly  selective 
about  what  they  retain,  may  generalize,  and 
may  choose  to  discard  previously  acquired 
knowledge.  When  individual  Advisors  apply 
current  useful  knowledge  to  construct  their 
recommendations,  they  integrate  these  learn¬ 
ing  strategies.  Full  details  on  Hoyle  are  avail¬ 
able  in  (Epstein,  1992). 

Visual  cues  are  integrated  into  Hoyle’s  deci¬ 
sion-making  process  as  new  Advisors  in  the 
second  tier.  These  Advisors  react  to  lines  and 
clusters  of  markers  without  reasoning.  This  is 
prompted  by  our  observation  that  people  guide 
their  play  with  frequently-observed  patterns  of 
pieces  before  they  understand  their  signifi¬ 
cance.  The  distinction  drawn  here  between 
thinking  and  seeing  in  game  playing  is  an  im¬ 
portant  one.  By  “thinking”  we  mean  the  ma¬ 
nipulation  of  symbolic  data,  such  as  “often- 
used  opening  gambit;”  by  “seeing”  we  mean 
inference-free,  explanation-free  reaction  to  vi¬ 
sual  stimuli.  The  three  Advisors  described 
here  are  directed  toward  the  construction  of  a 
system  that  both  uses  and  learns  visual  cues. 
They  provide  powerful  performance  gains  and 
promise  a  natural  integration  with  learning. 
This  paper  indicates  how  Hoyle,  already  a 
raultistrategy  learning  program,  can  integrate 
knowledge  about  visual  cues,  and  methods  to 
learn  them. 

2.  Using  Predrawn  Lines 

Morris  games  have  been  played  for  centuries 
throughout  the  world  on  boards  similar  to 
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(*)  (b)  (c) 

Figure  2:  Some  five  men’s  morris  states  with  white  to  move:  (a)  in  the  plaang  or  the  sliding 

stage,  (b)  and  (c)  in  the  sliding  stage. 


those  in  Figure  2.  For  clarity,  we  distinguish 
carefully  here  between  a  game  (a  board,  mark¬ 
ers,  and  a  set  of  rules)  and  a  contest  (one 
complete  experience  at  a  game,  from  an 
initially  empty  board  to  some  state  where  the 
rules  terminate  play).  We  refer  to  the 
predrawn  straight  lines  visible  in  Figure  2 
simply  as  lines.  The  intersection  of  two  or 
more  lines  is  a  position.  A  position  without  a 
marker  on  it  is  said  to  be  empty.  Although  the 
program  draws  pictures  like  those  in  Figure  2 
for  ou^ut,  the  internal,  computational 
representation  of  any  game  board  is  a  linear 
list  of  position  values  (e.g.,  black  or  white  or 
blank)  along  wtth  the  identity  of  the  mover 
and  whether  the  contest  is  in  the  placing  or 
sliding  stage.  The  program  also  makes 
obvious  representational  transformations  to 
and  from  a  two-dimensional  array  to 
normalize  computations  for  symmetry,  but  the 
array  has  no  meaningful  role  in  move  se¬ 
lection.  The  game  definition  includes  a  list  of 
predrawn  lines  and  the  positions  on  them. 

A  morris  game  has  two  contestants,  black  and 
white,  each  with  an  equal  number  of  markers. 
A  morris  contest  has  two  stages:  a  placing 
stage,  where  initially  the  board  is  empty,  and 
the  contestants  alternate  placing  one  of  their 
markers  on  any  empty  position,  and  a  sliding 


stage,  where  a  turn  consists  of  sliding  one’s 
marker  along  any  line  drawn  on  the  game 
board  to  an  immediately  adjacent  empty  posi¬ 
tion.  A  marker  may  not  jump  over  another 
marker  or  be  lifted  firm  the  board  during  a 
slide.  Three  markers  of  the  same  color  on  im¬ 
mediately  adjacent  positions  on  a  line  form  a 
mill.  Each  time  a  contestant  constructs  a  mill, 
she  captures  (removes)  one  of  tihe  other  con¬ 
testant’s  markers  that  is  not  in  a  mill.  Only  if 
the  other  contestant’s  markers  are  all  in  mills, 
does  she  capture  one  from  a  mill.  (There  are 
local  variations  that  permit  capture  only  dur¬ 
ing  the  slitting  stage,  permit  hopping  rather 
than  sliding  when  a  contestant  is  reduced  to 
three  near  a  contest's  end,  and  so  on.)  The  first 
contestant  reduced  to  two  markers,  or  unable 
to  noove,  loses. 

2.1  The  Coverage  algmittini 

When  a  marker  is  placed  on  any  position  on  a 
line,  it  is  said  to  c^ect  all  the  positions  on  dial 
line,  iiicluding  its  own.  The  coverage  of  a  po¬ 
sition  is  the  multiset  of  all  distinct  positions 
that  it  affects.  A  marker  positioned  where  two 
lines  meet,  induces  two  copies  of  its  position. 
Thus  the  coverage  of  3  in  Figure  2(a)  is  {1,  2, 
2*3,  10,  16}.  A  set  of  markers  belongit^  to  a 
single  contestant  P  produces  a  cover,  a  multi- 


304 


set  denoted  Cp  =  {crvi,  C2-v2,...,  Cb'Vb}  that 
lists  the  aifected  positi<nis  v],  V2,...>  Vb  and 
the  number  of  lines  Ci  cm  which  vj  lies  that  are 
affected  by  one  of  P’s  markers.  In  Figure  2(aX 
the  white  cover  is  Cw  •{2-1,  2,  3,  2-4,  5,  2*6, 
2'7,  2-8,  9,  11,  13,  14}.  The  cover  d^erence 
C~D  for  C*{crv|,  C2'V2,... .  Cb'Vb}  and  D* 
{drwi,d2'W2,  •  .dm  Wm}.  «  defined  to  be 
the  multiset  C-D  *  {x*  y  |  y  =  vi  for  some  i  = 
I,  2,...,  n;  x  y  e  C;  y  #  wj  for  any  j  *  1,  2,..., 
m}.  In  Figure  2(a),  Cb^w  =  {10,  12,  15, 
2*16}  and  Cwr^Ce  *  0-  We  take  the  standard 
definitions  firom  graph  theory  for  adjacency, 
path,  and  path  length. 

A  marker  offensively  offers  the  potential  to 
group  others  along  lines  it  lies  on 
{juxtaposition)  and  to  facilitate  movement 
there  {mobility),  while  it  defensively  obstructs 
the  opposition’s  ability  to  do  the  same.  The 
Coverage  algorithm  attempts  to  spread  its 
markers  over  as  many  lines  as  possible,  par¬ 
ticularly  lines  already  covered  by  the  other 
ccmtestant,  and  tnes  to  do  so  on  positions  with 
maximal  coverage.  Assume,  without  loss  of 
generality,  that  it  is  white’s  turn  to  move.  In 
the  placing  stage.  Coverage  reccmimencb  a 
move  to  every  empty  position  Ci'Vi€CB~Cw 
where  ct  >1.  If  there  are  no  such  positicms.  it 
recommends  a  move  to  every  positicm  in 
Cb~Cw  maximal  coverage.  If  there  are 
no  such  positions  of  either  kind,  it  reccmo- 
mends  a  move  to  every  empty  position  with 
maximai  coverage.  In  Figure  2(a)  with  White 
to  move  in  the  placing  stage,  Cb^Ow  { 10. 
12,  IS,  2*16}  so  Coverage  recommends  a 
move  to  16. 

In  the  sliding  stage.  Coverage  recommends 
each  legal  move  that  increases  Ivjl,  the  number 
of  the  mover's  distinct  covered  positions.  Let 
(p,q)  denote  a  sliding  move  from  position  p  to 
position  q.  In  Figure  2(b)  the  legal  moves 


(1.7).  (9.6).  (9,13),  (10,3),  (10,16),  (14,7) 
change  |vjl  by  -1,  ■♦-2,  0,  0,  0.  -1,  re^)ectively, 
so  Coverage  recommends  (9,6).  In  the  sliding 
stage,  howevo-,  one’s  covet  can  also  decrease. 
Therefore,  Coverage  also  recoimnends  each 
legal  slide  to  a  position  Ci*vi€CB  where  Ci  >1 
but  for  which  Cj  ^1  in  Cw.  In  Figure  2(c), 
where  Cb  =  (M.  2-2. 2-3. 2*4, 2*5, 6. 7,  8,  10, 
3-11.  312,  213.  214,  215.  216},  Cw  =  {21. 
2-2.  2-3,  2-4,  2*5,  2-6,  2-7.  2*8,  2*9.  210,  11. 
13.  2*14,  15,  2*16},  and  the  legal  moves  are 
(23).  (6.9).  (8.4),  (8.7).  (10.3).  (10,9),  (14,7), 
(14,15),  those  vertices  are  11,  12,  13,  15,  so 
Coverage  can  only  recommend  (14,15). 
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Figure  3:  A  placing  state  in  nine  men's 
morris,  white  to  move. 

2J2  The  Shortort  algoiitha 

The  Shortcut  algorithm  addresses  long-range 
ability  to  move,  and  does  so  without  forward 
search  into  the  game  graph.  The  algorithm  for 
Shortcut  begins  by  calculating  the  non-zm) 
path  lengths  between  pairs  of  same-color 
markers,  including  that  from  a  marker  to  itself. 
For  exanqjle,  in  Figure  3  dte  shortest  padis 
between  the  white  markers  on  2  and  20  are  [2, 
5,  6,  14,  21,  20],  [2,  3,  15,  14,  21,  20],  and  [2, 
5,  4,  II,  19,  20].  Next,  die  algoridim  selects 
those  pairs  for  which  the  shortest  non-zero 
length  path  between  diem  is  a  minimum.  It 
then  retains  only  those  shorts  paths  that  meet 
the  following  criteria:  every  enqity  position 
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lies  on  some  line  without  a  marker  of  the  op¬ 
posite  color,  and  at  least  one  position  on  the 
path  lies  at  the  int^section  of  two  such  lines. 
All  three  paths  identified  for  Figure  3  are  re¬ 
tained  because  of  positions  5,  14,  and  S,  re¬ 
spectively.  Shortcut  recommends  a  placing  or 
sliding  move  to  the  middlemost  point(s)  of 
each  such  path.  In  Figure  3,  Shortcut  therefore 
recommends  nK)ves  to  the  midpoints  6  and  14, 
15  and  14,  and  4  and  1 1 .  Computation  for  this 
algorithm,  styled  as  spreading  activation,  is 
very  fast. 

2  J  Results  udth  Coverage  and  Shortcut 

Prior  to  Coverage,  Hoyle  never  played  five 
men’s  morris  very  well.  There  are  approxi¬ 
mately  9  million  possible  board  positions  in 
five  men’s  morris,  with  an  average  branch 
factor  of  about  6.  After  500  learning  contests 
Hoyle  was  still  losing  roughly  85%  of  the 
time.  Once  Coverage  was  added,  however, 
Hoyle’s  decisions  improved  markedly. 
(Shortcut  was  not  part  of  this  experiment;  data 
averages  results  aaoss  five  runs.)  With  Cov- 
aage,  Hoyle  played  bener  faster;  after  32.75 
contests  it  had  learned  well  enough  to  draw  10 
in  a  row.  The  contests  averaged  33  moves,  so 
that  the  program  was  exposed  during  learning 
to  at  iiK>st  1070.5  different  states,  about  .012% 
of  the  search  space.  From  that  experience,  the 
program  was  judged  to  simulate  expert  play 
while  explicitly  retaining  data  on  only  about 
.006%  of  the  states  in  the  game  graph. 

In  post-learning  testing,  Hoyle  proved  to  be  a 
reliable,  if  imperfect,  expert  at  five  men’s 
morris.  When  the  program  played  20  addi¬ 
tional  contests  against  the  model  widi  learning 
turned  off,  it  lost  2.25  of  them.  Thus  Hoyle 
after  learning  is  88.75%  reliable  at  five  men’s 
morris,  still  a  strong  performance  after  such 
limited  experience  and  with  such  limited  re¬ 


tention  in  so  large  a  search  space.  Additional 
testing  displayed  increasing  prowess  against 
decreasingly  skilled  opposition,  an  argument 
that  expertise  is  indeed  being  simulated. 

With  a  search  space  about  16,000  times  larger 
than  that  of  five  men’s,  nine  men’s  morris  is  a 
more  strenuous  test  of  Hoyle’s  ability  to  leara 
to  play  well.  Because  there  is  no  definition  of 
expert  outcome  for  this  game,  we  chose  sim¬ 
ply  to  let  the  program  play  50  contests  against 
the  model.  Without  Coverage  and  Shortcut, 
Hoyle  lost  every  contest.  With  them  both, 
however,  there  was  a  dramatic  improvement. 
Inspection  showed  that  the  program  played  as 
well  as  a  human  expert  in  the  placing  stage  of 
the  last  10  contests.  During  those  50  contests, 
which  averaged  60  moves  each,  it  lost  24 
times,  drew  17  times,  and  won  nine  times. 
(Some  minor  corrections  to  the  model  are  now 
underway.)  The  first  of  those  wins  was  on  the 
27th  contest,  and  four  of  them  were  in  the  last 
six  contests,  suggesting  that  Hoyle  was 
teaming  to  play  better.  With  die  addition  of 
less  than  200  lines  of  game-independent  code 
for  the  two  new  visually-cued  Advisors,  Hoyle 
was  able  to  leam  to  outperform  expat  system 
code  that  was  more  than  1 1  times  its  length 
and  restricted  to  a  single  game.  The  morris 
family  includes  versions  for  6,  9,  11,  and  12 
men,  with  different  predrawn  lines.  At  this 
writing,  Hoyle  is  learning  them  all  quickly. 

It  should  be  noted  that  neither  of  these  Advi¬ 
sors  applies  useful  knowledge;  instead,  they 
direct  the  learning  program’s  experience  to  the 
parts  of  the  game  graph  where  the  key  infor¬ 
mation  lies,  highly-seiective  knowledge  that 
distinguishes  an  expert  fiom  a  novice 
(Ericsson  and  Smith,  1991).  If  this  knowledge 
is  concisely  located,  as  it  appears  to  be  in  die 
morris  games,  and  the  learner  can  harness  it, 
as  Hoyle’s  learning  algorithms  do,  the  pro- 
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gram  learns  to  play  quickly  and  well.  As  de¬ 
tailed  here,  this  general  improvement  comes  at 
a  mere  fraction  of  the  development  time  for  a 
traditional  game-specific  expert  system. 

3.  Learning  Patterns 

Hoyle  is  a  limitedly  rational  system  that  delib¬ 
erately  avoids  exhaustive  search  and  complete 
storage  of  its  experience.  Consistent  with  this 
approach,  the  work  described  here  retains  only 
a  small  nuziU>cr  of  the  patterns  encountered 
during  play,  ones  with  strong  empirical  evi¬ 
dence  of  their  significance.  The  program  uses 
a  heuristically-organized  database  to  associate 
small  geometrical  arrangements  of  markers  on 
the  board  with  wiiming  and  losing.  The  asso¬ 
ciative,  hierarchical  pattern  database  i'*  a  new 
item  of  useful  knowledge.  The  first  level  of 
the  database  contains  states;  the  second  level 
contains  patterns. 

The  pattern  database  is  constructed  by  the 
pattern  classifier,  an  associated  learning  al¬ 
gorithm,  as  follows.  At  the  end  of  each 
contest,  every  state  that  occurred  during  the 
contest  is  cached  in  a  fixed-size  hash  table, 
noting  the  sequence  number  of  the  most  recent 
contest  in  which  it  appeared  and  whether 
Hoyle  won,  lost,  or  drew  there.  Each  new  state 
in  the  pattern  database  is  now  matched  against 
nine  templates  for  a  3x3  grid,  adjusted  for 
symmetry  and  shown  in  Figure  4.  A  “?”  in  a 
template  represents  an  X,  an  O,  or  an  empty 
space;  is  the  don’t  care  symbol.  A 
subpattem  is  an  instantiation  of  a  template, 
e.g.,  X’s  in  the  comers  of  a  diagonal. 
(Preliminary  empirical  tests  showed  this  to  be 
the  smallest  set  of  effective  templates.) 

The  second  level  of  the  pattern  database  con¬ 
sists  of  those  subpattems  which  appear  in  at 
least  two  states  of  the  first  level.  Most  states 


match  several  ways  and  therefore  make  multi¬ 
ple  contributions  to  counting  on  die  second 
level.  Each  subpattem  also  records  the  number 
of  contests  in  which  it  participated  in  a  win,  a 
loss,  and  a  draw.  Thus  a  subpattem  is  a  gener¬ 
alization  over  a  class  of  states;  those  that  have 
recently  occurred  with  some  frequency  and 
contain  simple  configurations  of  pieces.  Each 
subpattem  is  categorized  as  winning,  drawing 
or  losing  based  upon  which  kind  of  contest  it 
appeared  in  most  frequently. 
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Figure  4.  The  set  of  ten^lates  used  by  the 
pattern  classifier. 

It  is  important  to  forget  in  the  pattern 
database,  primarily  to  discount  novice-like 
play  during  the  early  learning  of  a  game. 
There  will  be  winning  contests,  and  patterns 
associated  with  them,  that  were  due  to  the 
learner’s  early  errors.  We  have  therefore  im¬ 
plemented  two  ways  to  forget  in  the  pattern 
database.  First,  when  a  hash  table  for  eidier 
states  or  patterns  is  full,  and  a  new  oitry 
should  be  made,  the  least  recently  used  entry 
is  eliminated,  based  on  its  most  recent  contest 
number.  Second,  at  the  end  of  every  contest, 
the  number  of  tiroes  each  state  was  encoun¬ 
tered  is  multiplied  by  0.9. 

Patsy  is  an  Advisor  that  ranks  legal  next 
moves  based  on  their  fit  with  the  pattern 
database.  Patsy  looks  at  the  set  of  possible 
next  states  resulting  from  the  current  legal 
moves.  Each  next  state  is  compared  widi  the 
subpattem  level  of  the  database.  A  matched 
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winning  subpattem  awards  the  state  a  +2,  a 
matched  drawing  subpattem  a  +1,  and  a 
matched  losing  subpattem  a  -2.  A  state’s  score 
is  the  total  of  its  subpattem  values  divided  by 
the  number  of  subpattems  in  the  cache.  Patsy 
recommends  the  move  whose  next  state  has 
the  highest  such  score.  Ties  are  broken  by 
random  selection  anwng  the  best  moves. 


Figure  5.  The  performance  of  Hoylite  with 
and  without  Patsy. 

Patsy  was  tested  within  a  severely  pared-down 
version  of  Hoyle,  called  Hoylite  here.  Hoylite 
has  only  two  of  Hoyle’s  original  Advisors, 
plus  Patsy.  The  pattern  classifier  forms  cate¬ 
gories  based  on  observed  game  states  and  as¬ 
sociates  responses  to  the  observed  states  by 
laming  during  play.  The  hash  table  sizes 
were  limited  to  SO  game  states  and  30  subpat¬ 
tems.  Three  tournaments  between  Hoylite  and 
a  perfect  tic-tac-toe  player  were  nm  to  assess 
the  performance  of  Hoylite.  The  perfect  player 
was  a  look-up  table  of  correct  moves.  Each 
tournament  was  continued  for  50  contests.  The 
average  cumulative  number  of  Hoylite’s  wins 
and  draws  is  ploned  against  contest  number  in 
Figure  5.  The  graph  compares  Hoylite’s  aver¬ 
age  performance  against  the  perfect  coizestant 
with  and  without  Patsy.  Qearly  Hoylite  per¬ 
forms  consistently  better  with  Patsy. 


There  are  many  games  that  are  played  on  a 
3x3  grid.  At  this  writing  we  are  testing 
whether  the  same  pattern  teiiq)lates  in  Figure  4 
apply  to  several  other  games.  We  are  also 
gradually  adding  Hoyle’s  Advisors  to  Hoylite, 
to  see  what  conflicts,  if  any,  arise.  Finally,  we 
are  experimenting  with  more  sophisticated 
pattern  classifiers,  ones  that  model  the  re¬ 
sponse  of  die  human  eye  to  arrangements  such 
as  lines  of  pieces  and  lines  of  open  spaces. 

4.  Discussion 

Predrawn  game  board  lines  are  Si  .  here  to 
be  important,  readily  accessible  regularities 
that  support  better  playing  decisions.  Histori¬ 
cal  data  on  patterns  attractive  to  the  human 
eye  are  demonstrably  helpful  in  distinguishing 
good  middlegame  positions  from  mediocre 
ones.  The  brevity  of  the  code  required  to  capi¬ 
talize  on  these  visual  cues  for  a  variety  of 
problems  argues  for  the  limitedly  rational  per- 
^lective  of  the  architecture.  The  improvement 
the  new  Advisors  have  on  play  argues  for  the 
significance  of  visual  representations  as  an 
integral  part  of  decision  making.  When 
predrawn  board  lines  are  taken  as  visual  cues 
for  juxtaposition  and  mobility,  Hoyle  leams  to 
play  challenging  games  foster  and  better.  Cov¬ 
erage  and  Shortcut  in  no  way  diminish  die 
program’s  ability  to  leam  and  play  the  broad 
variety  of  games  at  which  it  had  previously 
excelled  (l^stein,  1992). 

Our  preliminary  examination  of  the  impact  of 
a  recognition-association  competitive  learning 
pattern  classifier  on  several  odier  expert 
knowledge  sources  and  learning  mediods  via  a 
blackboard  architecture  is  promising.  The 
game  played  was  a  simple  one,  and  only  two 
of  the  22  preexisting  Advisors  were  included. 
A  simple  game  was  chosen  to  facilitate  de¬ 
bugging  the  pattern  classifier  and  measuring 
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performance  against  an  absolute  standard. 
More  than  two  Advisors  would  have  obscured 
the  contribution  of  the  pattern-associative 
con^xment.  Hoylite’s  pattern  classifier  is  quite 
sinq>le  and  does  not  learn  new  templates;  it 
only  learns  which  game  states  are  important 
for  the  given  set  of  tenq>lates.  It  can  be  seen 
from  these  preliminary  results  that  a  pattern 
recognition  conqMnent  can  be  smoothly  inte¬ 
grated  into  a  game  playing  system  that  in¬ 
volves  reasoning  and  limited  search. 

Heuristic  Advisors  are  ne^led  most  in  the 
middlegame,  where  the  large  number  of  pos¬ 
sible  moves  precludes  search.  It  has  been  our 
experience  with  more  complex  games,  where 
one  would  have  many  Advisors,  that  openings 
are  typically  memorized,  and  that  the  endgame 
can  be  well-played  with  Advisors  that  reason 
about  known  losing  and  winning  positions.  In¬ 
spection  reveals  that  Shortcut  and  Coverage 
contribute  to  decisions  only  in  the  mid¬ 
dlegame,  while  Patsy  works  on  the  opening 
and  middlegame.  In  the  full  version  of  Hoyle, 
other  Advisors  cover  the  opening,  and  an  ex¬ 
perience-driven  partial  retrograde  analysis 
learns  enough  useful  knowledge  to  tune  die 
endgame.  In  Hoylite,  the  other  two  Advisors, 
Victory  and  Panic,  address  the  endgame, 
leaving  Patsy  to  consider  patterns  important  at 
the  earlier  stages.  All  three  new  Advisors 
prove  to  filter  the  middlegame  alternatives  to  a 
few  likely  moves,  ones  that  might  then  benefit 
from  limited  search. 
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Abstract 

Analyzing  data  with  the  intent  of  inducing 
classification  rules  typically  proceeds  from  a 
set  of  training  data  in  wid<±  classifications 
are  kimwiL  In  the  event  classifications  are 
unknown,  algorithms  exist  for  performing 
unsupervised  learning  to  determine  concept 
classM  inherent  in  the  data.  In  this  pqrer, 
we  describe  experiments  applying  mul^le 
learning  strategies  for  dasdfying  unlabeled 
data.  Specifically,  three  unsupervised 
leamiog  algorithms  were  q^lied  to  a  large 
set  of  public  health  data  in  order  to 
determine  likely  concept  classes  for  the  data 
based  on  the  inherent  features  in  the  data. 
After  inducing  the  concept  classes,  the  data 
were  processed  by  a  decision  tree  algorithm 
in  order  to  determine  more  efficient 
classificatiou  rules  under  the  assunqition 
that  the  concepts  induced  during 
unsupervised  learning  were  correct 

Kqr  words:  Classification,  unsupervised 
learning,  clustering,  decision  trees 

1.  Introduction 

The  machine  learning  literature  describes 
several  approaches  for  classiiying  numerical 


data.  For  example,  decision  trees  (such  as 
those  generated  by  Quinlan’s  ID3  and  C4 
algorithms)  select  attributes  as  internal  test 
nodes  of  a  tree  to  determine  the  dass  to 
which  a  data  point  belongs,  given  at  the 
leaves  of  the  tree  (Quinlan,  1986).  Nearest 
neighbor  algorithim  store  training  exanq>les 
paired  with  a  classification  (Aha  er  aL, 
1991).  When  a  new  point  is  presented,  the 
stor<^  point  that  is  closest  in  some  sense 
(such  as  Euclidean  distance  or  Hamming 
distance)  is  selected  and  the  corresponding 
classification  reported. 

At  times,  labels  providing  classification 
information  are  not  available  with  the 
training  set  In  these  instances, 
unsupervised  learning  q^roadies  may  be 
enq;>lo^  to  detect  clusters  of  the  data. 
These  clusters  can  then  be  used  to  develq) 
an  initial  set  of  classification  labels  (albeit 
non-symbolic)  for  the  data. 

In  this  paper,  we  will  describe  applying 
mult^le  learning  strategies  to  a  large  set  of 
psychiatric  data  (Eaton  and  Ritter,  1988; 
Eattm  et  oL,  1989).  Specifically,  we  will 
conqwre  three  clustering  algorithms  and 
discuss  the  results  of  processing  resultant 
clusters  with  a  decision  tree  algorithm  to 
provide  an  efficient  classification  strategy. 
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The  psychiatric  data,  provided  by  the  Johns 
Hopkins  School  of  Public  Health,  consists  of 
over  7,000  data  points  describing  patients 
with  respect  to  dinical  depression  or 
anxiety.  Each  data  point  has  58  fields 
indicating,  for  exaQq>le,  whether  a  patient 
has  various  fears,  feelings  of  worthlessness, 
thoughts  of  suicide,  etc.  Our  experiments 
used  20  binary  attributes  from  the  58 
provided.  According  to  the  School  of  Public 
Health,  these  20  attributes  characterize 
depression  where  the  others  provide 
demographic  information  and  char^erize 
anxiety.  Note  that  none  of  the  data,  as  of 
yet,  have  been  classified  (Le.,  labeb  are  not 
known  a  fniori)^  thus  motivating  the  analysis 
of  unsupervised  learning  techniques. 

The  three  clustering  algorithms  examined 
include  a  non-hierarchical  approach,  a 
hierarchical  approach  (thus  renting  in  a 
decision  tree),  and  a  connectionist  approach. 
The  nonhierarchical  approach  is  bas^  on  a 
variation  of  MacQueen’s  k-means  method 
(MacQueen,  1967).  The  standard  k-means 
method  assumes  k  clusters  and  fits  the  data 
in  the  clusters  with  the  nearest  centroids. 
The  variation  of  this  method  used  permits  k 
to  vary  so  that  an  estimate  of  the  number  of 
classes  in  the  data  may  be  determined. 

The  second  cluster  analysis  ^roach  is 
hierarchical.  Hierarchical  sqiproaches  either 
divide  data  or  combine  data  in  a  tree 
structure.  Divisive  i^roaches  begin  with 
one  large  cluster  and  divide  into  smaller 
clusters  based  on  the  attributes. 
Agglomerative  approaches  begin  with  one 
cluster  for  ea^  training  sample  and 
combine  dusters  based  on  similarity.  The 
iq)proach  used  in  this  part  of  the  study  is  a 
divisive  approach  called  association  anafysis. 
This  approach  selects  an  attribute  to  divide 
clusters  by  computing  a  matrix  of  chi>square 
coeffidents  for  each  attribute  and  selecting 


the  coeffident  vrith  the  maximum  sum  of 
chi'Square  values  (Everitt,  1974). 

Finally,  Rumelhart  and  Zipser  (1986) 
describe  a  connectionist  a^roa<h  to 
dustering  using  competitive  learning.  The 
approach  proceetb  under  the  assumption 
that  dominant  attributes  will  generally 
determine  the  classification,  and  the 
network  reinforces  detection  of  the 
dominant  attributes  by  strengthening  weights 
associated  with  their  corresponding  uqmt 
nodes.  The  output  layer  then  applies  a 
winner-take-all  competition  strategy  to 
determine  the  cluster  to  which  a  data  point 
belongs. 

Since  the  experimental  data  used  was  not 
provided  with  classification  labels,  the 
second  phase  of  the  study  consists  of 
generating  decision  trees  based  on  the 
dassifications  derived  from  the  dustering 
techniques.  Quinlan’s  C4  algorithm  is 
applied  to  the  results  of  all  three  dustering 
techniques,  and  the  resulting  trees  conpared 
to  rules  that  can  be  derived  firom  the 
dustering  algorithms  themselves. 

2.  Inducing  Concept  Classes  Using 
Unsupervised  Learning 

There  are  many  ways  to  characterize 
machine  learning  algorithms.  One  approach 
is  based  upon  whether  or  not  an  external 
“teacher”  exists.  The  two  resulting  types  of 
learning  algorithms  are  referred  to  as 
supervised  learning  and  unmpervised 
learning.  Typically,  supervised  learning 
proceeds  when  the  results  of  some  action 
are  analyzed  by  a  critic  in  comparison  with 
known  or  expected  results.  EHscrepandes 
between  the  two  are  used  to  determine  ways 
to  modify  internal  representations  of  the 
data  so  as  to  improve  performance. 
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Unsupervued  learning,  on  the  other  hand, 
does  not  have  the  advantage  of  an  external 
teacher  to  determine  “s^ropriate”  behavior 
or  “correct”  classifications.  Rather,  data  are 
examined  and  organized  in  such  a  way  as  to 
identify  internal  consistency.  The  class  of 
cluster  analysis  algorithms  generally  fall 
within  the  set  of  unsupervised  learning 
algorithms.  In  the  following  sections,  we 
will  describe  the  details  of  the  three 
unsupervised  learning  algorithms  used  in 
this  study. 

2.1  Qustoing  by  k-ntuks 

The  first  technique  for  clustering  fits  within 
the  class  of  non>hierarchical  techniques. 
Non-hierarchical  clustering  begins  by 
selecting  an  initial  set  of  clusters  and  alters 
the  partitions  so  as  to  improve  some  metric 
For  example,  nearea  centn^  methods 
atten^t  to  develop  partitions  such  that 
classification  is  made  by  comparing  a  point 
to  the  centroids  of  the  clusters.  The  class 
corresponding  to  the  nearest  centroid  is  the 
one  iitentified  for  that  data  point. 

One  of  the  most  common  approaches  to 
non-hierarchical  clustering  is  MacQueen’s  k- 
means  algorithm  (MacQueen,  1967).  The  k- 
means  algorithm  attempts  to  determine  the 
k  best  clusters  for  a  set  of  data  such  that 
classification  is  made  by  finding  the  cluster 
with  the  nearest  Euclidean  distance.  Recall 
that  the  Euclidean  distance  between  two 
points  is  oimputed  as  follows: 


The  basic  ilr-means  algorithm  consists  of  the 
following  steps: 

1.  Select  the  first  ir  data  points  as  initial 
dusters  with  one  member  in  each 
duster. 

2.  Assign  the  remaining  m  -  k  data 
points  to  the  duster  with  the  nearest 
centroid. 

3.  After  assigning  each  point, 
recompute  the  corresponding 
centroid  of  the  duster  with  the  new 
point. 

4.  After  all  of  the  data  points  have 
been  assigned,  use  the  ii  dusters  as 
seed  points  and  pass  through  the 
data  one  more  time  for  a  final 
dassification. 

Variations  of  this  algorithm  exist  in  which 
the  dusters  converge  to  improved  dusters. 
These  variants  require  several  passes 
through  the  data,  but  the  law  of  diminishing 
returns  may  be  experienced  fairly  early  in 
the  process. 

Unfortunately,  for  our  purposes,  the  basic  k- 
means  algorithm  has  a  more  serious 
drawback.  This  algorithm  assumes  the 
number  of  dusters  is  known  and  force  fits 
all  of  the  data  into  exactly  k  dusters.  For 
this  reason,  MacQueen  also  proposed  a 
variant  in  which  the  number  of  dusters  is 
not  known.  This  algorithm  is  the  one 
selected  for  this  study  and  is  composed  of 
the  following  steps: 


dist(j>l,p2) 


N 


where  xJt  is  the  ith  attribute  of  point  pi  and 
x2i  is  the  ith  attribute  of  p)oint  p2.  Since  all 
of  the  attributes  in  the  data  set  are  binary, 
distance  reduces  to  the  square  root  of  the 
Hamming  distance. 


1.  Select  values  for  an  initial  k  and  two 
additional  parameters,  C 
(coarsening)  and  R  (refining). 

2.  As  in  the  basic  k-means  algorithm, 
select  the  first  k  data  points  as  the 
initial  dusters. 

3.  Compute  all  of  the  pairwise  distances 
between  each  of  the  duster 
centroids.  If  the  smallest  distance  is 
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less  than  C,  then  merge  the  two 
corresponding  clusters  and 
recompute  the  corresponding 
centroid.  Continue  merging  until  no 
other  merges  occur. 

4.  Assign  tbt  remaining  m  -  k  data 
points  one  at  a  time  to  the  cluster 
with  the  nearest  centroid.  If  the 
distance  to  the  nearest  centroid  is 
greater  than  R,  then  consider  the 
point  a  new  cluster  and  goto  step  3. 

5.  After  all  of  the  data  points  ^ve 
been  assigned,  use  the  cluster 
centroids  as  seed  points  and  pass 
through  the  data  one  last  time 
assigning  the  points  to  the  clusters 
with  the  nearest  centroids. 

This  algorithm  can  also  follow  convergent 
approaches,  and  as  before,  it  has  been 
found  that  diminishing  returns  exhibit 
themselves  early  in  the  process. 

22  An  associative  clustering  algorithm 

For  the  second  clustering  technique,  a 
hierarchical  approach  was  used. 
Hierarchical  clustering  produces  a  decision 
tree  by  which  data  points  can  be  classified 
according  to  the  determined  clusters.  In 
general,  hierarchical  clustering  is  either 
divisive  or  ag^merative.  Agglomerative 
approaches  proceed  with  each  data  point 
treated  as  individual  clusters.  Ousters  are 
then  (^mbined  to  form  higher  level  clusters. 
This  process  continues  until  a  group  of  high 
level  clusters  (or  a  single  cluster)  is 
identified.  Divisive  approaches  begin  with 
a  single  cluster  and  divide  the  cluster  into 
sub-clusters.  This  process  continues 
recursively  until  base  clusters  are 
determined. 

In  addition,  hierarchical  approaches  can  be 
classified  as  monothetic  or  pofythetic. 
Monothetic  techniques  attempt  to  cluster 


according  to  single  attributes  where 
polythetic  techniques  cluster  according  the 
values  of  all  of  the  attributes. 

The  technique  used  in  this  part  of  the  study 
is  a  monothetic,  divisive  cluster  analysis 
algorithm  called  association  analysis  (Everitt, 
1974).  Association  analysis  divides  clusters 
by  selecting  the  single  attribute  that 
provides  the  “best”  split.  The  concept  of  a 
best  split  has  been  defined  in  several  ways. 
For  example,  decision  tree  algorithms 
frequently  en^loy  concepts  from  Shannon’s 
information  theoiy  to  select  the  attribute 
that  provides  the  most  information 
independent  of  the  actual  values  of  the 
attributes  (Shannon,  1948). 

Association  analysis  selects  attributes  that 
maximize  the  chi-square  coefficients  of  the 
data.  Recall  that  chi-squared  is  conq>uted 
as  follows: 

^ 

where  s^  is  that  sample  variance,  is  the 
population  variance,  and  n  is  the  sample 
size. 

For  association  analysis,  we  assume  all  of 
the  amibutes  are  binary.  The  computation 
of  the  chi-square  coefficients  on  binary  data 
is  similar  to  the  standard  equatioiL  Let 
attrib^  be  the  ;tb  attribute  of  the  ith  data 
point  and  attrib^  be  the  A:th  attribute  of  the 
Ith  data  point.  Tben 

attribfj  ♦  attribg^ 

VI 

hjj  «  (1  -  attrib^  *  attrib^. 

VI 
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attrib^  *  (I  -  attribf^ 


number  of  input  nodes.  The  weight  matrix 
is  initialized  with  the  following: 


dji  »  (1  -  attrib^  *  (1  -  attribi^ 

Vi 

Then  the  chi-square  coefficients  are  simply 
computed  as 

/  .  M  -  fee)* » 

^  (a  ♦  hXa  ♦  eXfe  ♦  dXc  ♦  d) 
and  the  attribute  is  selected  such 


w.  -  —  ±  d;  ViJ 

where  n„  is  the  number  of  nodes  at  the 
input  layer  and  d  is  a  small  random  number 
generate  for  each  weight 

The  network  is  trained  by  processing  a  set 
of  training  data.  Then,  for  each  output 
node  in  the  network,  and  for  each  training 
case,  a  ‘*winning”  node  is  determined.  This 
wiimer  is  used  to  determine  which  node’s 
weights  are  to  be  updated.  The  winner  is 
determined  as  follows: 


2  J  Oustering  by  competitive  teaming 

For  the  final  clustering  technique,  a 
connecdonist  algorithm  was  selected.  In 
particular,  the  competitive  learning  neural 
network  described  Rumelhart  and  21ipser 
(1986)  was  implemented.  (Note  that 
variants  on  this  network  are  described  by 
von  der  Malsburg  (1973)  and  Grossberg 
(1987))  The  idea  be^d  conq>etitive 
learning  is  that  the  network  develops  a  set 
of  ‘Teature  detectors.”  When  data 
containing  a  learned  feature  are  submitted 
to  the  network,  then  the  activity  of  the 
network  identifies  which  feature  is  present. 
To  identify  features,  nodes  within  the 
network  “compete”  among  themselves  to 
respond  to  the  stimulus  pattern.  The  node 
that  wins  the  competition  has  the  feature 
associated  with  it  Consequently,  when  that 
node  becomes  active,  the  feature  has  been 
identified. 

In  order  to  train  a  competitive  learning 
network,  the  weight  matrix  is  constructed 
with  m  rows  and  n  columns,  where  m  ^  the 
number  of  output  nodes  and  n  =  the 


winner  »  max 

i 

where  is  the  value  in  the  weight  matrix 
corresponding  to  row  j  and  colmnn  i,  I,  is 
the  activation  value  of  iiqmt  node  i,  and  j 
ranges  over  the  number  of  outputs. 

The  competitive  learning  rule  is  then 
applied  to  the  winner  for  the  given  training 
instance.  In  other  words,  the  weights  in  the 
weight  matrix  are  only  modified  for  the 
connections  between  the  input  nodes  and 
the  wmning  output  node,  llie  update  rule 
for  modifying  the  weights  in  the  network  is 
as  follows: 


<•0 
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where  is  the  change  in  the  weight 
matrix  and  is  a  learning  factor. 

The  dusters  are  identified  by  winning  nodes 
when  a  data  point  is  presented  to  the 
network.  A  further  analysis  of  the  network 
can  help  to  identify  the  attributes  that  are 
most  significant  in  dustering  the  data.  In 
particular,  since  the  update  rule 
“strengthens”  the  connections  between 
winning  nodes  and  the  significant  inputs 
(i.e.,  attributes),  the  strong  attributes  for  a 
given  class  will  have  weights  greater  than 

3.  Inducing  Decision  Trees 

The  three  dustering  algorithms  described  in 
the  previous  sections  provide  approaches  to 
labeling  data  not  previously  labeled  for 
classification.  Once  labels  have  been 
assigned,  the  next  step  is  to  determine 
effident  and  effective  means  for  dassifying 
data  accorcfing  to  the  concepts  learned  that 
have  not  previously  been  encountered.  One 
approach  for  such  concept  learning  is  the 
inaction  of  decision  trees.  Perhaps  the 
most  famous  decision  tree  algorithm  is  ID3 
and  its  successor  C4,  both  developed  by 
Quinlan  (1986). 

ID3  and  C4  allow  attributes  to  be  multi¬ 
valued  (i.e.,  they  do  not  limit  attributes  to 
binary  values)  and  construct  dassification 
trees  by  selecting  attributes  that  provide  the 
best  split  among  the  data  according  to 
known  classifications.  The  resulting  tree  is 
then  used  to  classify  data  induding  data  not 
used  in  training.  The  rules  generated  for 
the  decision  tree  then  permit  dassification 
to  generalize  so  as  to  classify  new  data.  Of 
course,  since  we  do  not  toow  what  the 
correct  classifications  are  for  our 
experiments,  it  is  difficult  to  determine  how 
well  the  trees  generalize.  (Note  that  C4,  the 
program  used  in  these  experiments. 


automatically  constructs  trees  on  a  sul^t  of 
the  training  data  using  ten-way  aoss- 
validation  and  selects  a  tree  that  generalizes 
the  best  on  the  remaining  data.) 

In  order  for  ID3  and  C4  to  determine  the 
best  attribute  at  a  given  point,  Quinlan 
incorporated  the  information  entropy 
function  described  by  Shaimon  (1948).  The 
information  value  of  a  set  of  data  T  is 

fteq(r,T)  freq(c,,T) 

'"’■-St'-'T 

where  C  is  the  set  of  classes,  T  is  the  set  of 
training  instances,  and  freq(c^T)  is  the 
frequency  of  dass  i  occuring  in  T.  The 
expected  information  value  of  attribj  is 

1-1  I  *  I 

where  Vj  is  the  number  of  values  attribj  can 
have  and  T,  is  the  subset  of  T  with  attribj 
having  the  value.  Then  the  information 
gain  is  simply  /(T)  -  E{attribj).  The 
attribute  with  the  maximum  gain  is  selected 
for  the  root  of  the  current  subtree.  C4  adds 
several  techniques  for  pruning  the  trees, 
thus  making  the  final  trees  more  effident 
than  the  initial  ones  (Quinlan,  1987).  Also, 
C4  applies  a  gain  ratio  criterion  for  its 
splitting  criterion,  but  when  all  attributes 
are  binary,  the  result  is  identical  to  af^lying 
information  gaiiL 

4.  The  Public  Health  Data 

For  this  study,  psychiatric  data  on  anxiety 
and  depression  were  analyzed.  This  data  set 
consisted  of  over  7,000  samples  with  58 
binary  attributes.  The  data  set  was 
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collected  from  the  East  Baltimore 
Epidemiologic  Catchment  Area  (ECA) 
Program  and  was  supplied  by  the  Johns 
Hopkins  School  of  F^blic  Health  (Eaton 
and  Ritter,  1988;  Eaton  et  aL,  1989).  The 
data  was  not  categorized  prior  to  analysis, 
so  the  object  of  the  study  was  to  identify 
regularities  within  the  data  that  might 
suggest  natural  classifications. 

For  this  study,  the  data  set  was  reduced  in 
three  ways.  First,  several  of  the  samples 
had  attributes  with  unknown  values.  All 
samples  with  more  than  five  unknown 
attni)utes  were  eliminated  from  the  data  set. 
Second,  smce  all  of  the  attributes  were 
negative  characteristics,  ail  samples  in  which 
all  of  the  attributes  were  zero  were  also 
removed.  This  resulted  in  a  data  set  of 
approximately  2,000  points.  Finally,  20 
biiuuy  attributes  were  identified  as 
specifically  relevant  to  depression. 
Therefore,  all  of  the  clustering  algorithms 
limited  consideration  to  these  20  attributes. 
The  20  attributes  used  in  the  study  are  as 
shown  in  Table  I: 

5.  ExperimeDts 

As  mentioned  above,  the  experiments 
described  in  this  report  followed  four  major 
steps.  First,  k-means  clustering  (with  the 
(te^bed  modification)  was  af^Ued. 
Second,  the  reduced  data  set  was  processed 
by  association  analysis  to  ^nerate  a 
decision  tree.  Third,  the  competitive 
learning  neural  network  was  ^lied  to  data. 
Finally,  classification  labels  were  assigned  to 
the  data  points  based  on  the  results  for  each 
of  the  clustering  methocb  and  decision  trees 
were  generated  by  C4.  The  results  of  C4 
generating  decision  trees  will  be  discussed  at 
the  end  of  each  relevant  section. 
Unfortunately,  space  limitations  prevent  us 
from  including  all  of  these  trees.  The 


following  sections  describe  the  results  of  the 
clustering  studies. 

5.1  Jr>means  clustering 

jC-means  clustering  provides  a  technique  for 
determining  clusters  within  the  data  using  a 
principle  based  on  nearest  neighbor.  As 
such,  it  is  not  capable  of  handling 
overlapping  clusters.  On  the  other  hand,  it 
is  capable  of  clustering  based  on  all  of  the 
attributes  rather  than  limiting  its  view  to 
single  clusters  (i.e.,  it  is  polythetic).  Of 
course,  this  makes  it  more  diffic^t  to 
determine  relevant  rules  for  classification, 
but  we  attempt  to  extract  rules  from  the 
results  of  the  analysis. 

Recall  that  this  technique  requires  an  initial 
value  for  k  to  be  provided  as  well  as  a 
coarsening  and  refining  parameter.  The 
latter  two  parameters  were  determined 
empirically,  and  k  was  set  initially  to  10.  In 
particular,  tini  coarsening  parameter  was  set 
to  0  J  and  the  refining  parameter  was  set  to 

I. 9S.  It  was  found  that  coarsening  was 
highly  sensitive  to  values  near  1.0  and 
refining  was  highly  sensitive  near  2.0.  K~ 
means  was  actually  applied  last,  so  the 
parameters  were  select^  to  yield  results 
similar  to  the  other  two  techniques. 

Following  k-means  clustering  on  the  public 
health  data,  12  clusters  were  identified. 
Attributes  of  their  centroids  are  listed  in 
Table  D.  It  was  found  that  the  two  least 
similar  clusten  were  Ouster  8  and  Ouster 

II.  It  is  believed  that  these  dtisters  would 
represent  the  extremes  on  the  spectrum  of 
depression.  As  such,  it  would  be  valuable  to 
decipber  the  centroids  to  determine  the 
relevant  characteristics.  Ouster  8  showed 
'\^iy  low  incidence  of  ttepression  related 
attributes  with  the  exception  of  increased 
eating.  On  the  other  hand.  Ouster  11  show 
high  incidence  of  depression  related 
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TaUe  1.  Attribute  for  Public  Health  Data  on  Depression  and  Anxiety 


1. 

CX)NCENT 

Trouble  concentrating 

2. 

CRYING 

Crying  spells 

3. 

DEATHT 

Thought  about  death 

4. 

DEATHW 

Wanted  to  die 

5. 

EATLESS 

Lost  appetite 

6. 

GAIN2UB 

Eating  increased 

7. 

HOPELESS 

life  hopeless 

8. 

LOSE2LB 

Lost  weight 

9. 

MOVMORE 

Moving  all  of  the  time 

10 

SAD2WK 

Sad  for  two  weeks 

11. 

SAD2YRS 

Sad  for  two  years 

12. 

SEXDIM 

Diminished  interest  in  sex 

13. 

SLPLESS 

Trouble  falling  asleep 

14. 

SLPMORE 

Sleeping  too  much 

15. 

SUIDTRY 

Attempted  suicide 

16. 

surraiNK 

Thought  of  suicide 

17. 

THINKSLO 

Thoughts  slower 

18. 

TIRED 

Tired  out 

19. 

TMSLOW 

Talked  more  slowly 

20. 

WSG2WK 

Worthless,  sinful,  guilty 

attributes  in  aU  but  two  attributes— 
increased  eating  and  moving  aU  the  time. 

The  attributes  at  the  centroids  can  be 
considered  as  weighted  ‘^presence”  of  that 
attribute  in  determining  whether  or  not  a 
point  belongs  to  some  cluster.  These 
weights  spanned  0.1  to  0.9,  so  a  decision 
tree  generated  C4  will  not  divide  cleanly 
along  the  attributes  (as  one  might  expect 
from  a  hierarchical  clustering  analysis  such 
as  the  one  discussed  in  the  next  section).  In 
fact  the  pruned  decision  tree  generated  by 
C4  has  62  paths  and  a  maximum  depth  of 
13  steps. 

It  is  interesting  to  note  that  tl^  top 
attributes  of  the  C4  tree  are  feelings  of 
worthlessness,  being  sad  for  two  weeks,  and 
thinking  slowly.  The  first  two  were  also 


found  to  be  significant  in  the  study  reported 
in  Eaton  and  Ritter  (1988).  On  the  other 
hand,  thoughts  of  death  (considered  to  be 
the  most  significant  attribute  in  the  Eaton, 
et  aL  study)  aqppears  fairly  deep  in  the  tree. 

Chi-sqoare  clustering 

The  results  of  running  the  chi-square 
association  analysis  on  the  public  health 
data  was  a  decision  tree  that  yielded  16 
classifications  (Table  m).  Since  the  basic 
goal  in  classifying  this  data  is  to  determine 
whether  or  not  a  patient  is  depressed,  it  is 
apparent  that  subtategories  may  exist  within 
the  data.  Unfortunately,  we  are  not  in  a 
position  to  determine  the  nature  of  these 
subcategories  without  the  basic  labeling  of 
the  data. 
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TaMe  II.  Ouster  Attributes  from  /T-Means  Algorithm. 


CLUSTER 

mOHEST  ATTRIBUTES 

LOWEST  ATTRIBUTES 

1 

TIRED 

SUIDTRY 

2 

EATLESS 

CONCENT,  CRYING,  DEATHW, 

GAIN2LB.  HOPELESS,  MOVMORE, 
SAD2WK,  SEXDIM,  SUIDTRY, 
SUITHINK,THINKSLO,  WSG2WK 

3 

WSG2WK 

EATLESS,  GAIN2LB,  LOSE2LB, 

SEXDIM,  SUIDTRY,  SUTTHINK, 

THINKSLO 

4 

HOPELESS 

EATLESS,  GAIN2LB,  LOSE2LB, 

SAD2WK,  SAD2YRS,  SLPMORE, 

SUIDTRY,  THINKSLO,  TMSLOW 

5 

TIRED 

CRYING,  DEATHT,  DEATHW, 

GAIN2LB,  HOPEI  ESS,  MOVMORE, 
SAD2WK,  SAD2YRS,  SLPLESS, 

SUIDTRY,  SUTTHINK,  WSG2WK 

6 

SLPLESS 

CONCENT,  DEATHT,  DEATHW, 

EATIESS,  GAIN2LB,  HOPELESS, 

LOSE2LB,  SAD2WK,  SAD2YRS, 

SEXDIM,  SLPMORE,  SLPLESS, 

SUIDTRY,  SUTTHINK,  TMSLOW 

WSG2WKS 

7 

DEATHT 

CONCENT,  CRYING,  DEATHW, 

EATLESS,  GAIN2LB,  HOPELESS, 

LOSE2LB,  SAD2WK,  SAD2YRS, 

SEXDIM,  SLPMORE,  SLPLESS, 

SUIDTRY,  SUTTHINK,  TMSLOVv 

WSG2WKS 

8 

SAD2WK 

DEATHW,  GAIN2LB,  MOVMORE, 
SLPMORE,  SUIDTRY,  SUTTHINK, 
TMSLOW,  WSG2WK 

9 

GAIN2LB 

CONCENT,  CRYING,  DEATHW, 

EATLESS,  HOPELESS,  LOSE2LB, 

SAD2YRS,  SUIDTRY,  SUTTHINK, 
THINKSLO,  TMSLOW,  WSG2WK 

10 

CONCENT, TMNKSLO 

MOVMORE,  SUIDTRY,  SUTTHINK 

11 

DEATHT,  DEATHW, 
HOPELESS,  LOSE2LB, 
MOVMORE,  SAD2WK, 
SUmUNK,  TIRED, 
WSG2WK 

EATLESS,  GAIN2LB,  SLPLESS 

SLPMORE,  THINKSLO,  TMSLOW 

12 

CONCENT,  DEATHT, 
DEATHW,  HOPELESS, 
SAD2WK,  SUTTHINK, 
THINKSLO 

GAIN2LB,  LOSE2LB 
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Tkbk  III.  Decision  Rules  from  Chi-Square  Qustering. 


CLUSTCR  BU1£ 

1  DEATHW-l,CX)NCENT-l,SUrnnNK-l,HOPELESS-l 

2  DEATHW-1,  CONCENT- 1,SU1THINK-1,HOPELESS-0 

3  DEATHW-l,  CONCENT-1,  SUlTHINK-0,  THINKSLO-1 

4  DEATHW-1,  CONCENT- 1,SU1THINK-0,THINKSLO-0 

5  DEATHW- 1,  CONCENT-0.  SUIDTRY- 1 

6  DEATHW- 1,  CONCENT-0,  SUIDTRY-0,  LOSE2LB- 1 

7  DEATHW-1,  CONCENT-0,SUIDTRY-0,LOSE2LB-0,SAD2WK«1 

8  DEATHW-1,  CONCENT-O.SUIDTRY-O.LOSE2LB-0,SAD2WK-0 

9  DEATHW-O.SLPMORE-1 

10  DEATHW-0,  SLPMORE-0.  SUITHINK- 1 

11  DEATHW-0,  SLPMORE-0,  SUTTHINK-O,  CONCENT- 1,  DEATHT- 1 

12  DEATHW-0,  SLPMORE-0,  SUHHINK-O,  CONCENT- 1,  DEATHT-0 

13  DEATHW-0,  SLPMORE-0,  SUHHINK-O,  CONCENTT-O,  SAD2WK- 1, 
HOPELESS- 1 

14  DEATHW-0,  SLPMORE-O,  SUnHINK-0,  CONCENT-O,  SAD2WK- 1, 
HOPELESS-0 

15  DEATHW-0.  SLPMORE-0,  SUnWNK-O,  CONCENT-0,  SAD2WK-0, 
THINKSLO-1 

16  DEATHW-,  SLPMORE-0,  SUTTHINK-O,  CONCENT-0,  SAD2WK-0, 
THINKSLO-0 


Perhaps  the  most  interesting  observation  to 
be  made  from  this  analysis  was  determining 
which  of  the  attributes  are  considered  most 
significant  in  separating  the  data.  Since 
association  analysis  is  a  hierarchical 
technique,  attributes  used  near  the  root  of 
the  tree  differentiate  between  high  level 
clusters  where  attributes  used  near  the 
leaves  of  the  tree  differentiate  between  finer 
grained  clusters.  So  the  first  observation  is 
that  the  attribute  DEATHW  (i.e.,  wanting 
to  die)  should  be  highly  indicative  of 
whether  or  not  a  patient  is  depressed, 
assuming  only  the  two  classifications  exist 
and  a  single  attribute  can  distmguish  the 
two  clusters.  Of  course,  this  assumption 
may  be  totally  inappropriate.  Another 
plausible  interpretation  is  that  the  clusters 
generated  by  this  technique  (and  by  the 
others)  represent  ‘^degrees”  of  depression. 
As  su^  wanting  to  die  may  suggest  more 
severe  depression  while  the  lack  of  such 


thoughts  may  not  completely  eliminate 
depressioiL 

It  is  also  interesting  to  note  that,  in  Eaton 
and  Ritter  (1988),  classification  according  to 
dysphoria  (i.e.,  general  depression)  indicates 
the  highest  correlation  with  thoughts  of 
death.  Further,  two  of  the  four  classes  of 
depression  identified  indicated  dysphoric 
symptoms  (indicated  by  thoughts  of  death) 
as  a  leading  attribute.  The  attribute, 
SAD2WKS  was  also  considered  highly 
indicative  of  dysphoria.  The  chi-square 
analysis  performed  here  also  found  this 
attribute  to  be  significant  but  nearer  the 
leaves  of  the  tree. 

Finally,  the  results  of  the  chi-square 
q>proach  were  processed  by  C4  (Table  IV). 
Ihe  most  important  observation  that  we 
made  from  the  resulting  tree  was  that  the 
two  trees  are  very  similar  but  not  identicaL 
One  would  expect  the  trees  to  be  similar 
since  the  classifications  were  initially  made 
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Table  IV.  C4  Decision  Tree  from  Chi-Square  Clustering. 


DEATHW  -  0 

1 

SUmnNK  -  1 :  CLUSTER  10 

1 

SUTTHINK  -  0 

1 

1  CONCENT  -  0 

1 

1  1  SAD2WK  >  0 

1 

1  1  1  THINKSLO  »  0  :  CLUSTER  16 

1 

1  1  1  THINKSLO  =  1 :  CLUSTER  15 

1 

1  1  SAD2WK  -  1 

1 

1  1  1  HOPELESS  =  0  :  CLUSTER  14 

1 

1  1  1  HOPELESS  =  1 :  CLUSTER  13 

1 

1  CONCENT  »  1 

1 

1  1  DEATHT  =  0  :  CLUSTER  12 

1 

1  DEATHT  =  1 :  CLUSTER  11 

DEATHW  -  1 

1 

CONCENT  =  0 

1 

1  SUIDTRY  =  1 :  CLUSTER  5 

1 

1  SUIDTRY  =  0 

1 

1  1  LOSE2LB  =  1 :  CLUSTER  6 

1 

1  1  LOSE2LB  =  0 

1 

1  1  1  SAD2WK  =  0  :  CLUSTER  8 

1 

1  1  1  SAD2WK  =  1 :  CLUSTER  7 

1 

CONCENT  =  1 

1 

1  SUTTHINK  =  1 :  CLUSTER  1 

1 

1  SUTTHINK  =  0 

1 

1  1  THINKSLO  *  0  :  CLUSTER  4 

1 

1  1  THINKSLO  =  1 :  CLUSTER  3 

with  attributes  providing  “perfect”  splits  of 
the  data.  The  reason  for  the  difference  in 
the  trees  lies  in  the  metric  used  to  select  an 
attribute.  In  the  chi-square  approach,  the 
chi-square  metric  is  used  to  find  high  level 
variation  along  the  lines  of  the  attributes. 
C4  attempts  to  select  attributes  to  build  the 
decision  tree  under  a  similar  motivation,  but 
the  metric  used  is  information  gain.  The 
information  gain  metric  attempts  to  evenly 
split  the  data  into  near  equal  subsets.  In 
fact,  we  find  that  the  maximum  depth  of  the 
chi-square  tree  is  six  and  the  maximum 
depth  of  the  C4  tree  is  five.  For  16  classes, 
optimal  depth  of  the  tree  (assuming  equal 
sized  clusteis)  would  be  four  on  each 
branch.  No  calculations  were  conducted  to 
determine  expected  cost  to  classify  based  on 


the  size  of  the  data  set  and  the  path  lengths; 
however,  it  is  conjectured  that  C4’s  tree  will 
be  slightly  better. 

53  Competitive  clustering 

Finally,  the  competitive  learning  algorithm 
was  applied  to  the  public  health  data.  This 
approach  assumes  there  will  be  fewer 
clusters  than  attributes,  so  since  there  were 
20  attributes,  one  naturally  expects  fewer 
than  20  clusters.  Indeed,  competitive 
learning  identified  twelve  clusters  as  in  A:- 
means.  The  results  of  applying  competitive 
learning  are  shown  in  Table  V. 

One  should  observe  right  away  that  the 
results  are  very  similar  to  the  A:-means 
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Table  V.  Cluster  Attributes  from  Competitive  Learning  Algorithm. 


misisE 

HIGHBSTATTRIBUTCS 

LOWEST  ATTRIBUTES 

1 

CONCENT,  THINKSLO,  TMSLOW 

HOPELESS,  SUIDTRY, 

SUTTHINK 

2 

DEATHW,  SUTTHINK 

CRYING,  EATLESS,  SLPLESS, 
SLPMORE 

3 

DEATHT 

SUIDTRY 

4 

HOPELESS 

LOSE2LB 

5 

GAIN2LB,  SLPMORE 

CONCENT,  CRYING, 

DEATHW,  EATLESS, 

SAD2YRS,  SUIDTRY 

6 

EATLESS,  LOSE2LB 

DEATHW,  EATLESS, 

SUIDTRY 

7 

SEXDIM 

DEATHW,  HOPELESS, 

LOSE2LB,  SAD2YRS,  SUIDTRY, 
SUTTHINK,  WSG2WK 

8 

SAD2WK,  WSG2WK 

SUIDTRY,  SUTTHINK 

9 

CRYING,  HOPELESS,  SAD2WK 

SUIDTRY 

10 

TIRED 

GAIN2LB,  SAD2WK,  SUIDTRY, 
SUTTHINK 

11 

MOVMORE 

SUIDTRY 

12 

SLPLESS,  TIRED 

HOPELESS,  LOSE2LB, 

SUIDTRY,  SUTTHINK 

results.  First,  the  number  of  clusters  is  the 
same.  Examining  the  attributes  that  are 
significant  (by  examining  the  values  of  the 
weight  matrix)  reveals  that  there  are  several 
similar  clusters  within  the  network,  and 
some  of  these  clusters  correspond  to  the  k- 
means  clusters.  However,  it  also  appears 
that  the  A:-means  clusters  are  more  distinct. 
One  possible  explanation  for  this  is  that  the 
competitive  learning  algorithm  has  difficulty 
due  to  its  sensitivity  to  the  order  in  which 
the  data  are  presented. 

The  corresponding  decision  tree  generated 
by  C4  is  also  very  complex.  It  has  54  paths 
and  a  maximum  depth  of  17,  thus  its 
complexity  is  analogous  to  the  k-meam  tree. 
One  signiSScant  difference,  however,  is  the 
selection  of  primary  attributes  (i.e., 
attributes  near  the  root).  The  clusters 
generated  from  competitive  learning 


resulted  in  primary  attributes  of 
sleeplessness,  crying,  and  hopelessness. 
Only  the  latter  is  one  of  the  significant 
attributes  in  Eaton  and  Ritter  (1988).  In 
fact,  the  more  significant  attributes 
appeared  nearer  to  the  leaves  in  this  tree. 


The  results  of  this  study  suggest  that  several 
degrees  of  clinical  depression  may  exist. 
This  is  evident  by  the  fact  that  all  three 
clustering  algorithms  identified  on  the  order 
of  12  to  16  clusters  within  the  data.  Recall 
that  this  data  was  reduced  so  as  to  consider 
attributes  most  relevant  to  depression; 
however,  some  carryover  from  anxiety  is 
expected  to  have  occurred.  Nevertheless, 
the  number  of  clusters  identified  is  strong 
evidence  that  finer  classifications  may  exist 
for  depression. 


6.  Discussion 
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In  a  previous  study  applying  latent  class 
analysis  (Grove  et  ol.,  1987),  a  reduced  set 
of  clusters  was  assumed.  Specifically,  this 
study  assumed  two  classes.  The  studies 
reported  in  (Eaton  and  Ritter,  1988;  Eaton 
et  aL,  1989)  also  applied  latent  class  analysis 
and  found  four  classes.  In  a  more  recent 
study  (Furukawa  and  Sumita,  1992),  a 
hierarchical  clustering  algorithm  was  applied 
to  a  similar  data  set  and  three  clusters 
identified.  Unfortunately,  the  data  set  used 
was  extremely  small  (40  subjects)  thus 
making  it  difficult  to  compare  with  our 
results. 

For  our  experiments,  we  were  able  to 
observe  the  following.  First,  both  k-means 
and  competitive  learning  found  12  clusters 
with  similar  atoibutes.  Unfortunately,  the 
“significance”  of  the  attributes  for  the  two 
techniques  (as  evidenced  by  the  C4  decision 
trees)  did  not  agree.  Second,  the 
association  analysis  generated  16  clusters  by 
considering  clean  partitions  of  the  data 
along  individual  attributes.  Now  it  is 
unreasonable  to  assume  that  all  20  of  the 
attributes  are  independent,  so  the  idea  that 
such  a  clean  partitioning  can  occur  becomes 
difficult  to  accept.  In  fact,  many  of  the 
classification  rules  have  combinations  of 
thoughts  of  death,  wanting  to  die,  thinking 
about  suicide,  and  attempting  suicide.  But 
the  other  rules  seem  to  suggest  grades  of 
depression  when  combinations  of  these 
attributes  (and  others)  have  conflicting 
values  (e.g.,  Quster  8  included  wanting  to 
die  but  thinking  about  suicide  was  notably 
absent). 

Note  that  both  k-means  and  competitive 
learning  are  polythetic  algorithms  wMe  chi- 
square  clustering  and  C4  are  monothetic 
algorithms.  From  this  it  should  not  be 
surprising  that  chi-square  clustering  and  C4 
yield  comparable  results  as  do  k-means  and 
competitive  learning.  It  is  also 


understandable,  given  this  difference,  that 
the  C4  trees  for  k-means  and  competitive 
learning  would  be  much  more  complex  than 
the  C4  tree  for  chi-square  clustering. 

Aside  from  the  obvious  differences  in  the 
trees  generated  by  all  three  techniques, 
these  trees  also  had  several  similarities. 
First,  the  principal  attributes  all  tended  to 
agree  with  Eaton  and  Ritter  (1988)  and  the 
trees  tended  to  be  highly  complex  Further, 
in  post-pruning,  all  tluree  trees  showed 
minimal  rearrangement.  Thus  the  initial 
trees  appeared  to  be  near  optimal  for  the 
training  set. 

Several  additional  analyses  could  be  done 
on  this  data.  First,  if  the  data  were 
classified,  then  the  classifications  could  be 
compared  to  the  clusters  identified  to 
determine  if,  indeed,  degrees  or  hierarchies 
of  depression  exist.  Second,  closer 
examination  of  the  centroids  of  the  clusters 
generated  by  all  three  techniques  may  be 
useful  in  determining  how  sin^ar  the  tree 
results  really  are.  For  example,  it  is 
possible  that  the  12  clusters  identified  by  k- 
means  may  closely  correlate  to  the  12 
clusters  identified  by  competitive  learning 
(although  the  decision  trees  seem  to 
indicate  the  opposite).  Unfortunately,  time 
did  not  permit  such  a  correlation  analysis  to 
be  run. 

Finally,  additional  classification  algorithms 
could  provide  interesting  results.  For 
example,  AutoClass  by  Cheeseman  et  aL 
(1988)  is  a  Bayesian  classification  tool  that 
attempts  to  identify  the  most  probable  set  of 
clusters  within  the  data.  Running  AutoQass 
on  the  data  would  provide  another  valuable 
data  point  in  determining  the  character  of 
the  depression  data. 

Another  alternative  clustering  system  that 
we  may  apply  is  Fisher’s  COBWEB  (1987). 
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CX>BWEB  is  an  incremental  system  for 
hierarchical  conceptual  clustering.  While 
our  problem  does  not  need  to  be  examined 
incrementally,  COBWEB  offers  the 
advantage  of  applying  a  different  utility 
measure  (i.e.,  category  utility)  to  evaluate 
generated  clusters.  It  also  contracts  the 
classification  tree  by  using  traditional  search 
operators  such  as  merging  and  splitting 
(corresponding  to  generalization  and 
specialization  respectively).  Finally,  since  it 
represents  concepts  probabilistically, 
COBWEB  should  be  better  suited  to  the 
large  data  set  than  more  rigid  clustering 
algorithms  such  as  A:-means  or  chi-square 
clustering. 

Traditional  conceptual  clustering  as 
introduced  by  Michalski  and  Stepp  (1983) 
and  further  developed  by  Stepp  and 
Michalski  (1986)  rely  on  incorporating 
background  knowledge  in  evaluating  the 
quality  of  the  resulting  clusters.  In  our 
problem,  little  to  no  background  knowledge 
was  available,  so  this  traditional  approach 
could  not  be  applied  easily.  COBWEB’S 
advantage  over  CLUSTER/2  (Michalski 
and  Stepp,  1983)  or  CLUSTER/S  (Stepp 
and  Michalski,  1986)  is  that  the  evaluation 
function  is  domain  independent.  However, 
we  would  expect  the  availability  of  domain 
knowledge  to  improve  classification 
strategies. 

7.  Summary 

In  this  paper  we  presented  the  results  of 
three  approaches  to  analyzing  and  clustering 
a  large  set  of  psychiatric  data.  As  a  result 
of  this  study,  it  is  apparent  that  depression 
cannot  be  categorized  either  as  simply 
present  or  absent.  Further,  it  is  unlikely  as 
few  as  three  or  four  classes  of  depression 
are  sufficient.  The  results  of  this  study 
suggest  that  there  are  many  degrees  of 
depression  ranging  from  no  depression  to 


severe  depression.  Further,  depending  on 
the  means  by  which  clusters  are  identified, 
it  is  also  apparent  that  a  relatively  well 
defined  (although  not  necessarily  small)  set 
of  rules  can  be  derived  to  assist  in 
classifying  a  patient  as  fitting  in  one  of  the 
categories.  These  rules  may  be  expressed 
either  in  terms  of  a  decision  tree  (as  in 
association  analysis)  or  as  a  linear  equation 
(as  in  the  neural  net).  And  in  each  of  these 
cases,  additional  decision  trees  can  be 
constructed  which  clearly  delineate  the  rules 
to  be  applied  for  classification. 
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